This week, Chris Whong, a self-professed “urbanist, mapmaker, data junkie” in Brooklyn, put out one of the more thrilling data projects we’ve seen recently. It was a visualization of the day in the life of a New York City taxi, and each time you load the page, you get a different taxi’s journey. There’s a voyeuristic thrill to watching a taxi driver go from Midtown to SoHo, only to have to turn around and go right back up. While the map is churning, a box in the upper right keeps track of the car’s fares, passenger count and tips (credit card only). It’s a captivating bird’s-eye view of what New York City residents normally only glimpse from the ground.
I was so enamored with the visualization that I called Whong, a 33-year-old data solutions architect at Socrata, an open-data organization, to chat about how it came to be. If you’re intrigued, visit the visualization and then come back for a little making-of Q&A (lightly edited for clarity and brevity).
Chadwick Matlin: This taxi thing is really impressive. Narrate me through your process a bit.
Whong: I was always aware that this data existed because I had seen it presented to me in a class at NYU. And when the TLC’s Twitter showed up I responded with, “Is the data available?” knowing that it wasn’t, but I wanted to hear what they said. Their response was, “You can FOIL it,” which I wasn’t actually expecting. Then it just became an adventure of seeing where this would take me, because I had never actually done a FOIL request before. I’ve heard lots of horror stories about getting back giant reams of paper, or printouts of PDFs, and printouts of charts, or even just them saying they don’t have the data or charging money for it. I have to give the TLC credit; it was relatively painless, despite having to make two trips downtown and provide a brand-new hard drive at personal expense. But overall, they were very responsive and I didn’t spend a lot of time waiting.
CM: When you see a data set, how do you know there’s more in there for you to get at?
Whong: If there is a data set that’s got events that happened with a location tied to them and a time tied to them, those are usually going to be perfect for this sort of thing. We call it spatiotemporal data. I’m not entirely sure about where this idea for a day in the life came into my head, but it popped in at some point.
After [the data] was hosted, it kind of became this darling data set of data scientists everywhere. I’ve seen researchers from universities all over the world putting out analyses of this data. But what was interesting was that somebody put it into Google BigQuery, and there was a very interesting post on Reddit about it. The fact that this thing existed on Reddit — the answers were a question away. I just had to say, “Hey, can someone help me with this?” and somebody did.
If that hadn’t happened, it would have been probably beyond my skill set to move a data set this large into a data viz and start building out queries for it. I probably could have figured that out; it’s just not something I do on a daily basis, especially with the kind of query I would’ve needed to pull out a single day’s rides for one cab. It’s not extremely difficult. It’s just outside of my comfort zone.
CM: How long do you think you spent overall with the data to get to the visualization I saw?
Whong: I’ve probably been working on that for a month, and that was all evenings and weekends. I can’t give you a good hour count, but if you were to add it all up, I’d probably say it took four or five 10- to 12-hour working days.
CM: What was your motivation on those nights and weekends?
Whong: I’m a civic hacker. I’m always interested in some new approach to understanding the civic environment, understanding cities via technology. For me, that manifests itself usually in visualization and usually in playing with urban data. So I’m always looking for another juicy data set to find some nuggets of truth or some nuggets of information that are not otherwise available just by looking at the rows and columns.
CM: In the past, hacker types were activists in one way or another. But there isn’t quite an ideological bent to what you’re doing. You’re helping something about the city become more clear. Do you think there’s something to that, or is it just spreading information?
Whong: The broader activist cause is going to be open data and transparency in general, which I fully support. But then if you open up the right data sets in the right context in the right message, the value add to the city can be extremely high. The much more granular activist bent would be, “Can we use this data to actually make decisions and build a case about the efficiencies of our current regulatory system for taxicabs?” We’re in the middle of all this with Uber and Lyft.
CM: What did you learn from doing the project and from watching the end result?
Whong: There are just surprisingly large gaps in the travels of a taxi. You’d assume they’re always out there, but you can’t have all 13,000 cabs on the street all the time. But the gaps come in very interesting parts [of the day], so I don’t really know the answer to why the gaps in service are happening when they do in the middle of the day. Some cabs have an hour and 45 minutes with no fare. Is that a lunch break? Is it going back to the base and switching shifts? Or switching drivers somewhere? Is it just refueling? Or is it just bad luck and not finding a fare?
I’ve always wondered how much these guys make in the day. What is the economy of the medallion? And I know medallions are extremely expensive; I know most of these cabs are owned by companies, and the drivers rent their shifts, but how much can they make in a day? I’ve seen spreads from $400 up to over $700, $800 bucks in one 24-hour period. Obviously that’s spread out among different shifts. [The data set breaks the fares down by driver. Whong’s initial visualization didn’t distinguish between different drivers taking different shifts in the same cab.]