DJ Patil was recently named the White House’s deputy chief technology officer for data policy and chief data scientist, making him the first-ever national data scientist. DJ even coined the term “data scientist” back in 2008. He was most recently the vice president of product at RelateIQ, and was previously the head of data products and chief scientist at LinkedIn. In a phone call Monday, I spoke with DJ about open data, his transition from the private sector to government, and the Obama administration’s data-focused initiatives and transparency record. Here is a lightly edited transcript of our conversation.
Andrew Flowers: Thanks so much for speaking with FiveThirtyEight. We’re obviously big fans of data and were very excited to see the White House appointed a chief data scientist.
DJ Patil: And I’m excited, too. It’s really been impressive how much you guys have been doing. I’m really amazed at how much content you have been able to create that is super-high quality.
AF: So in your role as chief data scientist of the United States, are you primarily an ambassador from the government to the data-science community (by, for example, promoting the use of open government data)? Or are you more of an import from the data-science community into the government (brought in to help it make better use of its data and run more efficiently, for instance)?
DJP: My career arc: I’m a product of academia, having spent time in government before, and industry. When I had this opportunity come up and really started to dig in, we asked ourselves: What would be the mission of this role? And it really gets to: How do we responsibly unleash the power of data for the benefit of the American public and maximize the nation’s return on investment in data? Almost everything comes down from that.
Three big areas fall out of that. The first is the Precision Medicine Initiative. This is the idea of, how do we take data science, plus all the work that’s happened over the years in bioinformatics, bring that together to provide the next generation of health care, understanding of cancer treatments, chronic disease, etc.?
The second one is the idea of how we open up the data. The president has this executive order that data should be machine-readable. We’re like, “Of, course. That’s how we operate now.” It’s happening already. The National Weather Service doesn’t just produce paper; it produces netCDF files. But how do you open that up, to provide a place where people can easily find that data and build on it?
The third bucket — and I think we have a really unique time for, and that is a really interesting angle for FiveThirtyEight — it’s the intersection of data science across the federal government. How do you use data to do something really beneficial? To turn it into data products?
And the interesting thing here is the government has an incredibly rich history of people who are appropriately called data scientists. But we call them statisticians. Like the guys who do the census. You think about the work these guys have to do each decade to get that precision — it’s astonishing.
What does data science bring to this that statisticians don’t have? We don’t want to ignore statistics or statisticians. The census is the building block. Then data scientists can start thinking about it and turning it into something that will have value to different types of people. What does it mean to have a data product? You don’t have to see the data; it’s something that facilitates an end goal through the use of data.
AF: What challenges and restrictions do you expect in the world of government that you may not have faced in the private sector?
DJP: It’s almost the other way around, in many cases. Because inside the government, there are so many opportunities for the data science community to contribute, to build more things.
So actually, it’s been the flip of the problem: How do we get the industry to start building more on this? You’ve got weather data, census data, health data, climate data. How do you start taking that information to build new things?
AF: We at FiveThirtyEight are big fans of open data — we just built an interactive data visualization of airline delays and flight times, which used data from the Bureau of Transportation Statistics. You’ve touted data.gov and the more than 100,000 data sets it makes available, but which data sets should the government be publishing but isn’t right now? And what are the underused data sets that most people don’t know about, or aren’t accessible enough?
DJP: This is exactly one of the things that we’re working hard to figure out. The first step is opening up the data. Now, opening it up is mostly a question of being thoughtful that we’re not exposing anyone to risk. That is a very critical component of this. Because the key component of that mission statement is to responsibly unleash the power of data. We want to do that in a smart and well-thought-through way.
What data should be open is equally driven by the quality of the ecosystem’s evolving sophistication. As you process data, you probably don’t want the raw, raw data. You want the clean data. If it’s not treated that way, as a product, by how the ecosystem wants that data, it’s really not usable.
A perfect example of this is if we took all the satellite imagery and just gave the exact waveform in the IR spectrum, that’s not useful. You want the IR spectrum translated into an image. And you want those images to be tiles that you can combine, so you can overlay that on top of a weather pattern, to create some graphical viewer like Google Earth.
If I gave you a raw binary, you’d be like, “Great, what am I supposed to do with this?”
AF: To even have open data, you need government agencies first producing it. And we’ve seen a number of efforts from Republicans in Congress to cut back funding for statistical agencies, and even to eliminate funding for the Census Bureau’s American Community Survey altogether. Are these programs important, and are you worried about the threat to their funding?
DJP: Data, when it’s open, provides transparency. It gives us an ability to view into the government, see how it operates. And it gives us a form of checks and balances, to make sure we’re doing it properly. It’s the American public’s data. They have a right to it.
By producing this data, we retain our competitiveness. My favorite example of this is the National Weather Service. Go look at any research paper in the world — by and large, almost everyone does their analysis on the U.S. National Weather Model. In my case, as a graduate student, I would take over the computers in the math department when people weren’t looking and process that data to understand that the weather is not as chaotic as we perceived. And that can translate into benefits for weather forecasting.
Now who benefits from that? We get to write a bunch of papers, and we’re funded by federal funds. But the people who benefit is the country. Those improvements go back to the National Weather Service. They try it, they play with it, they iterate on it and make it better. Open data is a literal force multiplier that makes the systems we all depend on better.
AF: You mentioned how data, when it’s open, it provides transparency, and I think that’s right. Data science relies on transparency. But some question this administration’s commitment to transparency, given its aggressive investigations into leaks, heavy use of classification, etc. How satisfied were you with the administration’s commitment to transparency before taking this job, and will you advocate for more openness?
DJP: The thing that got me excited was the track record. First, the president was the first person to think critically about what is a dashboard, and have a dashboard for IT spending. Second, this idea of creating www.data.gov and providing things in a centralized way. Third, amplifying that by having a commitment to open data through the executive order, having hackathons, data jams and datapaloozas around the health data. Then along came the idea of the Precision Medicine Initiative — that the foundation of the next transformation in health care is data science plus bioinformatics.
It’s been transformational as a data scientist to see what this has done for us.
AF: You’ve mentioned establishing a data science culture within government departments. What does the next generation of data scientists working in government need to know? What specific statistical knowledge and/or software tools are you looking for in recruits, through programs like the U.S. Digital Service?
DJP: The No. 1 thing is you’ve got to have passion. This rich passion for going ruthlessly after the problem and being deeply intellectually honest with yourself about whether this is a reasonable answer. And you guys have highlighted that extremely well in a bunch of your analysis. You guys know what I mean when I say that.
The second part is having the ability to be extremely clever with the data. And what I mean by that is: You’re working with ambiguity. And very often you can’t approach the problem with the rigor you would a homework assignment. The only way to survive through that is by being clever — to think of a different question that gets at the answer.
As for tooling, I think we get extremely caught up in the tool types. There is extraordinary power from Excel all the way through [the programming language] R. I’m less dogmatic about the particular tool. What I am dogmatic about is: Does the tool allow you to effectively create a narrative? Does that tool allow you to really make sure you’re asking hard questions?