Erez Lieberman Aiden and Jean-Baptiste Michel on What We Learned From 5 Million Books at TED Talk event….
Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we, at Harvard, were wondering if this was really true. So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.
Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there’s a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there’s an X-axis for that, which is the practical axis. This is very, very low.
Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That’s very practical and extremely awesome.
Erez Lieberman Aiden: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.
Now when Google digitizes a book, they put it into a really nice format. Now we’ve got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that’s not the highest quality data. What we’re left with is a collection of 5 million books, 500 billion words, a string of characters a thousand times longer than the human genome — a text which, when written out, would stretch from here to the Moon and back 10 times over — a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, “Stand back. We’re going to try science.”
Jean-Baptiste Michel: Now of course, we were thinking, well let’s just first put the data out there for people to do science to it. Now we’re thinking, what data can we release? Well of course, you want to take the books and release the full text of these 5 million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have 5 million books, that is 5 million authors, that is 5 million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that’s extremely, extremely impractical.
Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we’re going to release statistics about the books. So take for instance “A gleam of happiness.” It’s four words; we call that a four-gram. We’re going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of 2 billion lines that tell us about the way culture has been changing.
Erez Lieberman Aiden: So those 2 billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let’s suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, “Yesterday, I throve.” Alternatively, I could say, “Yesterday, I thrived.” Well which one should I use? How to know?
Well, as of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you’d say, “Steve, you’re an expert on the irregular verbs. What should I do?” And he’d tell you, “Well most people say thrived, but some people say throve.” And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, “Tom, what should I say?” He’d say, “Well, in my day, most people throve, but some thrived.”