Transcript: What We Learned From 5 Million Books at TED Talk

Erez Lieberman Aiden and Jean-Baptiste Michel

Erez Lieberman Aiden and Jean-Baptiste Michel on What We Learned From 5 Million Books at TED Talk event….

TRANSCRIPT: 

Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we, at Harvard, were wondering if this was really true. So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.

Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there’s a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there’s an X-axis for that, which is the practical axis. This is very, very low.

Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That’s very practical and extremely awesome.

ALSO READ:   Indistractable: How to Master the Skill of the Century: Nir Eyal (Transcript)

Erez Lieberman Aiden: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.

Now when Google digitizes a book, they put it into a really nice format. Now we’ve got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that’s not the highest quality data. What we’re left with is a collection of 5 million books, 500 billion words, a string of characters a thousand times longer than the human genome — a text which, when written out, would stretch from here to the Moon and back 10 times over — a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, “Stand back. We’re going to try science.”

Jean-Baptiste Michel: Now of course, we were thinking, well let’s just first put the data out there for people to do science to it. Now we’re thinking, what data can we release? Well of course, you want to take the books and release the full text of these 5 million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have 5 million books, that is 5 million authors, that is 5 million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that’s extremely, extremely impractical.

ALSO READ:   The Power of Belief: Mindset and Success by Eduardo Briceno (Full Transcript)

Pages: First |1 | ... | | Last | View Full Transcript

Scroll to Top