Erez Lieberman Aiden and Jean-Baptiste Michel on What We Learned From 5 Million Books at TED Talk event….
Erez Lieberman Aiden: Everyone knows that a picture is worth a thousand words. But we, at Harvard, were wondering if this was really true. So we assembled a team of experts, spanning Harvard, MIT, The American Heritage Dictionary, The Encyclopedia Britannica and even our proud sponsors, the Google. And we cogitated about this for about four years. And we came to a startling conclusion. Ladies and gentlemen, a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.
Jean-Baptiste Michel: So how did we get to this conclusion? So Erez and I were thinking about ways to get a big picture of human culture and human history change over time. So many books actually have been written over the years. So we were thinking, well the best way to learn from them is to read all of these millions of books. Now of course, if there’s a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there’s an X-axis for that, which is the practical axis. This is very, very low.
Now people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical, but not so awesome. What you really want to do is to get to the awesome yet practical part of this space. So it turns out there was a company across the river called Google who had started a digitization project a few years back that might just enable this approach. They have digitized millions of books. So what that means is, one could use computational methods to read all of the books in a click of a button. That’s very practical and extremely awesome.
Erez Lieberman Aiden: Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books. And this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the libraries and digitized by Google, which has scanned 15 million books to date.
Now when Google digitizes a book, they put it into a really nice format. Now we’ve got the data, plus we have metadata. We have information about things like where was it published, who was the author, when was it published. And what we do is go through all of those records and exclude everything that’s not the highest quality data. What we’re left with is a collection of 5 million books, 500 billion words, a string of characters a thousand times longer than the human genome — a text which, when written out, would stretch from here to the Moon and back 10 times over — a veritable shard of our cultural genome. Of course what we did when faced with such outrageous hyperbole was what any self-respecting researchers would have done. We took a page out of XKCD, and we said, “Stand back. We’re going to try science.”
Jean-Baptiste Michel: Now of course, we were thinking, well let’s just first put the data out there for people to do science to it. Now we’re thinking, what data can we release? Well of course, you want to take the books and release the full text of these 5 million books. Now Google, and Jon Orwant in particular, told us a little equation that we should learn. So you have 5 million books, that is 5 million authors, that is 5 million plaintiffs is a massive lawsuit. So, although that would be really, really awesome, again, that’s extremely, extremely impractical.
Now again, we kind of caved in, and we did the very practical approach, which was a bit less awesome. We said, well instead of releasing the full text, we’re going to release statistics about the books. So take for instance “A gleam of happiness.” It’s four words; we call that a four-gram. We’re going to tell you how many times a particular four-gram appeared in books in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books, and that gives us a big table of 2 billion lines that tell us about the way culture has been changing.
Erez Lieberman Aiden: So those 2 billion lines, we call them two billion n-grams. What do they tell us? Well the individual n-grams measure cultural trends. Let me give you an example. Let’s suppose that I am thriving, then tomorrow I want to tell you about how well I did. And so I might say, “Yesterday, I throve.” Alternatively, I could say, “Yesterday, I thrived.” Well which one should I use? How to know?
Well, as of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair, and you’d say, “Steve, you’re an expert on the irregular verbs. What should I do?” And he’d tell you, “Well most people say thrived, but some people say throve.” And you also knew, more or less, that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, “Tom, what should I say?” He’d say, “Well, in my day, most people throve, but some thrived.”
So now what I’m just going to show you is raw data. Two rows from this table of 2 billion entries. What you’re seeing is year by year frequency of “thrived” and “throve” over time. Now this is just two out of 2 billion rows. So the entire data set is a billion times more awesome than this slide.
Jean-Baptiste Michel: Now there are many other pictures that are worth 500 billion words. For instance, this one. If you just take influenza, you will see peaks at the time where you knew big flu epidemics were killing people around the globe.
Erez Lieberman Aiden: If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature.
Jean-Baptiste Michel: You might also want to have a look at this particular n-gram, and that’s to tell Nietzsche that God is not dead, although you might agree that he might need a better publicist.
Erez Lieberman Aiden: You can get at some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700, in 1800, in 1900, no one cared. Through the ‘30s and ‘40s, no one cared. Suddenly, in the mid-40s, there started to be a buzz. People realized that 1950 was going to happen, and it could be big. But nothing got people interested in 1950 like the year 1950. People were walking around obsessed. They couldn’t stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened, in ’51, ’52, ’53. Finally in 1954, someone woke up and realized that 1950 had gotten somewhat passé. And just like that, the bubble burst.