Rich Data, Poor Data

This story appears in ESPN The Magazine’s March 2 Analytics Issue. Subscribe today!

In the 2000 edition of Baseball Prospectus, Keith Woolner identified 23 problems — avenues of analysis that had been dead ends for turn-of-the-millennium statheads. (For instance, No. 10: “Projecting minor league pitchers accurately.”) Woolner named these Hilbert Problems, after mathematician David Hilbert, who in 1900 outlined his own set of 23 vexing mathematical problems that he hoped would be solved in the 20th century.

Of Hilbert’s 23 math problems, just 10 have been answered — not a great track record for more than a century’s worth of work. While Woolner’s baseball problems don’t lend themselves to mathematics’ hard-and-fast proofs, we have become a lot better at, say, “measuring the catcher’s role in run prevention” (No. 3). There’s still a margin of error in calculating how valuable Yadier Molina is to the Cardinals; nevertheless, the progress in baseball is remarkable.

Analysts have made huge strides in “separating defense into pitching and fielding” (problem No. 1): The discovery that pitchers have relatively little control over balls in play has increased the value put on fielding and pitchers’ strikeout ability. And research into “determining optimal pitcher usage strategies”  (No. 20) has led teams to transform struggling starters into top-shelf middle relievers with ERAs that would make Bob Gibson blush. Indeed, the shift toward pitching and defense reflects the rise of sabermetrics as much as the decline of juiced balls or juiced players.

And all of this has taken 15 years, rather than since William McKinley was president. Sure, teams could still glean more about “assessing the ‘coachability’ of players” (No. 13) or “quantifying the manager’s impact on winning” (No. 22). But baseball analysts can’t complain, unlike their counterparts in other fields.

As I describe in my book “The Signal and the Noise: Why So Many Predictions Fail but Some Don’t,” the rapid and tangible progress in sports analytics is more the exception than the rule. It’s important to remind sports nerds — who, as they look at streams of PER or wRC+ numbers, have become a bit spoiled — of this fair and maybe even obvious point. Because out there in the wider world, questions far more basic than Woolner’s remain unresolved. We still have tremendous trouble predicting how the economy will perform more than a few months in advance, or understanding why a catastrophic earthquake occurs at a particular place and time, or knowing whether a flu outbreak will turn into a bad one.

It’s not for any lack of interest in data and analytics. For a while, I gave a lot of talks to promote my book and met a lot of people I might not encounter otherwise: from Hollywood producers and CEOs of major companies to the dude from India who hoped to be the Billy Beane of cricket.

But there’s a perfect storm of circumstances in sports that makes rapid analytical progress possible decades before other fields have their Moneyball moments. Here are three reasons sports nerds have it easy:

1. Sports has awesome data.

Give me a sec. Really, I’ll only need a second. I just went to Baseball-Reference.com and looked up how many at-bats have been taken in major league history. It’s 14,260,129.

The volume is impressive. But what’s more impressive is that I can go to RetroSheet.org and, for many of those 14 million at-bats, look up the hitter, the pitcher, who was on base, how many people attended the game and whether the second baseman wore boxers or briefs. It’s not just “big data.” It’s something much better: rich data.

By rich data, I mean data that’s accurate, precise and subjected to rigorous quality control. A few years ago, a debate raged about how many RBIs Cubs slugger Hack Wilson had in 1930. Researchers went to the microfiche, looked up box scores and found that it was 191, not 190. Absolutely nothing changed about our understanding of baseball, but it shows the level of scrutiny to which stats are subjected.

Compare that to something like evaluating the American economy. The problems aren’t in the third decimal place: We sometimes don’t even know whether the sign is positive or negative. When the recession hit in December 2007 — the worst economic collapse since the Great Depression — most economists didn’t believe we were in one at all. The recession wasn’t officially identified until December 2008. Imagine what this would be like in sports! We’re not sure how many points Damian Lillard scored last night, but we’re reasonably confident it was between 27 and negative 2. Check back in a few months.

As if statheads weren’t spoiled enough, we’re getting more data all the time. From PITCHf/x to SportVU, we have nearly a three-dimensional record of every object on the field in real time. Questions once directed at scouts — Does Carmelo really get back on defense? What’s the break on Kershaw’s curve? — are now measurable.

2. In sports, we know the rules.

And they don’t change much. As I noted, there has been little progress in predicting earthquakes. We know a few basic things — you’re more likely to experience an earthquake in California than in New Jersey — but not a lot more.

What’s the problem? “We’re looking at rock,” one seismologist lamented to me for my book. Unlike a thunderstorm, we can’t see an earthquake coming, nor can we directly observe what triggers it. Scientists have identified lots of correlations in earthquake data, but they have relatively little understanding of what causes one at any particular time. If there are a billion possible relationships in geology’s historical data, you’ll come up with a thousand million-to-one coincidences on the basis of chance alone. In seismology, for instance, there have been failed predictions about earthquake behavior in locations from Peru to Sumatra — all based on patterns that looked foolproof in the historical data but were random after all.

False positives are less of an issue in sports, where rules are explicit and where we know a lot about causality. Take how we evaluate pitcher performance. It turns out that if you want to forecast a pitcher’s future win-loss record, just about the last thing to look at is his previous record. Instead, focus on his ERA, or better yet his strikeout-to-walk ratio, or maybe even the PITCHf/x data on pitch velocity and location.

Why? Winning is the name of the game, and you win by allowing fewer runs than your opponent. So ERA says more about winning than a pitcher’s record. But you can do even better: Runs are prevented by striking out batters (and not walking them), and strikeouts are generated by throwing good pitches, which is why WHIP and strikeouts per nine innings also serve predictive purposes. Understanding the structure of the system gives statistical analysis a much higher batting average.

3. Sports offers fast feedback and clear marks of success.

One hallmark of analytically progressive fields is the daily collection of new data that allows researchers to rapidly test ideas and chuck the silly ones. One example: dramatically improved weather forecasts. The accuracy of hurricane landfall predictions, for instance, has almost tripled over the past 30 years.

Sports, especially baseball, fits in this category too. In Billy Beane’s first few years running the A’s, the team had awful defenses — bad enough that Matt Stairs briefly played center. Beane theorized that because defense was so hard to quantify, he shouldn’t focus on it. His assumption turned out to be completely wrong. As statheads came to learn about defense, it proved to be more important than everyone thought, not less. Because the A’s were playing every day and Beane could study the defensive metrics like dWAR that emerged, he learned quickly and adjusted his approach. His more recent teams have had much-improved defenses.

Contrast this with something like presidential elections, in which lessons come once every four years, if at all. Mitt Romney’s belief that the 2012 election was his for the taking (it wasn’t, according to both public polls and political science research) may have led him to underinvest in his get-out-the-vote operations. He underestimated Barack Obama’s popularity and his own ability to sway voters with his message. Republicans will have to wait until 2016 to improve their approach.

It also helps that sports has a clear objective: winning. Obvious? Sure. But that’s not the case in other subjects. What counts as “winning” for the U.S. economy, for instance? Is it low inflation or high growth? If it’s growth, does it matter how the income is distributed? You have opinions about that, and I do too, and we might not agree even given all the data in the world.

But the zero-sum nature of sports competition (there are a finite number of wins and championships to go around) also yields the greatest risk to continued innovation. When I was working for Baseball Prospectus a decade ago, most of the innovation was occurring among outsiders like us. It was competitive, but the point of getting a data “scoop” was to publish it for the rest of the world to see.

Now almost all MLB teams employ a statistical analyst, if not a small gaggle of them. But those analysts are working on behalf of just one team — and have less incentive to share. At the MIT Sloan Sports Analytics Conference every year, the panels featuring current employees of major league teams are deathly dull because if the panelists said anything useful to a roomful of their competitors, they would be fired. Sports analytics runs the risk of losing the momentum of the past 15 years.

Woolner, for his part, is now the director of baseball analytics for the Indians. No doubt he has 23 new problems to solve. But now it will take the rest of us longer to know when he has cracked them.

FiveThirtyEight

Rich Data, Poor Data

1. Sports has awesome data.

2. In sports, we know the rules.

3. Sports offers fast feedback and clear marks of success.

Comments