Ben Wellington – Data scientist
Six thousand miles of road, 600 miles of subway track, 400 miles of bike lanes, and a half a mile of tram track, if you’ve ever been to Roosevelt Island. These are the numbers that make up the infrastructure of NYC, these are the statistics of our infrastructure. They’re the kind of numbers released in reports by city agencies.
For example, the Department of Transportation will probably tell you how many miles of road they maintain. The MTA will boast how many miles of subway track there are. But most city agencies give us statistics. This is from a report this year from the Taxi & Limousine Commission, where we’ve learned that there is about 13,500 taxis here in NYC. Pretty interesting, right?
But did you ever think about where these numbers came from? Because for these numbers to exist somebody at the city agency has to stop and say hmm, here’s a number that somebody might want to know. Here’s a number that our citizens want to know. So they go back to their raw data, they count, they add, they calculate, and then they put out reports. And those reports will have numbers like this. The problem is, how do they know all of our questions?
We have lots of questions. In fact, in some ways there’s literally an infinite number of questions that we can ask about our city. So the agencies can never keep up. So the paradigm isn’t exactly working and I think our policy makers realize that because in 2012, Mayor Bloomberg signed into law what he called the most ambitious and comprehensive open data legislation in the country. In a lot of ways he’s right. In the last two years the city’s released 1,000 data sets on our open data portal and, it’s pretty awesome. You look at data like this, and instead of counting the number of cabs, we can start to ask different questions.
So I had a question: When is rush hour in NYC? It can be pretty bothersome. When is rush hour exactly? And I thought to myself, these cabs aren’t just numbers, these are GPS recorders driving around in our city’s streets recording each and every right they take. There’s data there. And I looked at that data and I made a plot of the average speed of taxis in NYC throughout the day.
You can see that from around midnight to around 5:18 AM, speed increases, and at that point, things turn around. They get slower, slower and slower until about 8:35 AM when they end up at 11.5 mph. The average taxi is going at 11.5 mph in our city streets, and it turns out it stays that way for the entire day. So I said to myself, I guess there’s no rush hour in NYC, there’s just a “rush day.” Makes sense.
This is important for a couple of reasons. If you are a transportation planner, this might be pretty interesting to know. But if you want to get somewhere quickly you now know to set your alarm for 4:45 AM and you’re all set. New York, right? But there’s story behind this data, it wasn’t just available as it turns out. It actually came from something called a Freedom of Information Law Request, or a FOIL Request. This is a form you can find on the Taxi & Limousine Commission website. In order to access this data, you need to go get this form, fill it out, and they will notify you. And a guy name Chris Whong did exactly that.
Chris went down and they told him, “Just bring a brand new hard drive to our office, leave it here for 5 hours, we’ll copy the data and you take it back.” And that’s where this data came from. Now, Chris is the kind of guy that wants to make the data public, so it ended up online for all to use and that’s where this graph came from. And the fact that it exists is amazing. These GPS recorders – really cool!
But the fact that we have citizens walking around with hard drives picking up data from city agencies to make it public – it was already kind of public, you could get to it, but it was “public”, it wasn’t public. And we can do better than that as a city, we don’t need our citizens walking around with hard drives. Now, not every dataset is behind a FOIL request.
Here’s a map I made with the most dangerous intersections in NYC based on cyclist accidents. So the red areas are more dangerous. What it shows is first the East side of Manhattan, especially in the lower area of Manhattan, has more cycle accidents. That might makes sense because there are more cyclist coming off the bridges over there. But there’s other hotspots worth studying. There’s Williamsburg. There’s Roosevelt Avenue in Queens. This is exactly the type of data we need for vision zero. This is exactly what we’re looking for.
But there’s story behind this data as well. This data didn’t just appear. How many of you guys know this logo? Yeah, I see some shakes. Have you ever tried to copy and paste data out of a PDF and make sense of it? I see more shakes. More of you tried to copying and pasting than knew the logo. I like that. What happen is, the data that you just saw was actually on a PDF. In fact, hundreds, and hundreds, of pages of PDF put out by our own NYPD, and in order to access it, you either have to copy and paste for hundred and hundred of hours, or you could be John Krauss. John Krauss is like, I’m not going to copy and paste this data, I’m going to write a program. It’s called the NYPD Crash Data Band-Aid. And it goes to the NYPD’s website and it would download PDFs. Every day with it would search; if it found a PDF, it would download it, and it would run some PDF-scraping program, and out would come the text and it would go on the Internet, and people could make maps like that.
And the fact that the data is here, that we can have access to it – every accident, by the way, is a row on this table. You can imagine how many PDF that is. The fact that we have access to that is great. But let’s not release it in PDF form. Because then we’re having our citizens write PDF scrapers. It’s not the best use of our citizens’ time, and we, as a city, can do better than that. The good news is that the de Blasio Administration actually released this data a few months ago, so now, we can have access to it.
But there’s a lot of data still entombed in PDF. For example our crime data, still is only available in PDF. And not just our crime data, our own city budget. Our city budget is only readable right now in PDF form. And it’s not just us that can’t analyze it – our own legislators who vote for the budget, also only get it in PDF. So our legislators cannot analyze the budget that they are voting for. And I think as a city we can do a little better than that as well.
Now, there’s a lot of data that’s not hidden in PDFs. This is an example of a map I made. And this is the dirtiest waterways in NYC. How do I measure dirty? Well, it’s kind of a little weird, but I looked at the level of fecal coliform, which is a measurement of fecal matter in each of our waterways. The larger the circle, the dirtier the water. The large circles are dirty waters, the smaller circles are cleaner. What you see is inland waterways. This is all data that was sampled by the city over the last 5 years. And inland waterways are, in general, dirtier. That makes sense, right?