This weekend, more than 150 Southern California college students are staying up late with a big data set.
DataFest, in its fourth year, brings the spirit of hackathons to statistics. Each year, organizers present an original data set to student teams with names such as Bayesball Players and the University of Statistical Champions (from USC). The students have fewer than 48 hours to make sense of the data, often running statistical analyses or creating visualizations.
Trying to get some quick insight from a data set is what we often do here at DataLab. And DataFests face some of the same challenges we do. For instance, lots of really interesting data sets aren’t public, and their owners are reluctant to disclose them for competitive or privacy reasons. DataFest’s strikes deals with providers that allow students to publish their analyses but not the underlying data; the providers get to benefit from the students’ insights. This year’s provider — GridPoint, an energy-management company — has modified its building-energy-consumption data to make it impossible to identify the buildings. Prior data providers include the Los Angeles Police Department and the online-dating site eHarmony (which also has provided data to FiveThirtyEight).
“It’s really rare to get really current data actually being used in the real-live corporate world, so that makes it really special,” Robert Gould, DataFest’s founder and a professor of statistics at UCLA, the host for this weekend’s event, said in a phone interview. “Somehow it’s just not that thrilling for the students to learn all we’ve done is point them to a public data set. There’s something really special to have someone who owns the data present the challenge. It makes the students feel they’re being paid attention to and listened to.”
The students also get the benefit of advice from local data professionals who visit during the marathon sessions. This year they include academics, and people from the energy, health and tech sectors, including representatives from IBM, eHarmony and Google. Some consultants are evaluating students even as they’re helping them, Gould said. “Some of them end up hiring some of the students. They get a good insight as to how mature they are.”
The first DataFest, in 2011, was 25 students at UCLA. The next year Duke University hosted its own event, and this year, three more schools hosted events. If everyone expected this weekend shows up, more than 400 students will have tried their hands at the GridPoint data.
At Duke, students combined outside data sources and conversations with company representatives with data-analysis tools such as hierarchical regression and nonparametric data smoothing. (I’m omitting the details at Gould’s request so the UCLA participants start from scratch. No cheating!) Students can use whatever software they prefer — though just about all winners have used R, the open-source statistical programming language, Gould said. Contestants come from many departments, including computer science and even the humanities. “The teams that have done best usually have a mixture of talents,” he said.
Gould’s frustration with the quality of students’ work on class projects inspired him to create the event in 2011. “I was a little frustrated with the way our students, even the bright ones, didn’t rise to the occasion for final projects or special projects,” he said.
In DataFest, students can present up to two slides, for no more than five minutes, and then are judged by a panel of faculty and local data pros, for awards such as best use of outside data and best visualization.
Although the data sets would fit by some standards in the buzzy rubric of “big data” — several million rows — students are expected to use whichever techniques from data science or traditional statistics are most appropriate. “The idea is not to fit a statistical model, but to do best by the data,” Gould said. “We’re not specifying any estimation has to happen. Tell us something that surprises us, that we wouldn’t have known.”
There are limits, though, to how much college students — or anyone, for that matter — can accomplish with a novel data set in two days. Gould said some students created an informative crime map with the LAPD data. But asked whether any of the analysis led to changes to police policy, Gould replied: “I would hope not. Mistakes will happen over 48 hours, and these are undergraduates.”