It’s time for some probability theory. Imagine you’d passed out on Lime-A-Ritas before kickoff on Sunday night and woken up in a cold sweat at 3 a.m. to read the headline, “New England Patriots win 34-28.” That would have been about the least surprising Super Bowl result imaginable. The Patriots were favored before the game according to Vegas betting lines and FiveThirtyEight’s Elo projections, and it was expected to be high-scoring.
But, oh, what drama you would have missed! The Patriots overtime comeback from a 28-3 deficit made for perhaps the most exciting Super Bowl ever, as well as one of the largest comebacks in NFL history, Super Bowl or otherwise. You have to watch a lot of football games to see something like that happen.
If your social media feeds are like mine, they contained a blurry mix of politics and sports on Sunday night. So as the Atlanta Falcons collapsed, there were a lot of people (myself included!) comparing them to Hillary Clinton and the victorious New England Patriots to Donald Trump. But Super Bowl Sunday and election night weren’t really very much alike. On one of them, something highly unlikely occurred — and the other was pretty much par for the course.
The truly unlikely event was the Patriots’ epic comeback. According to ESPN’s win probability model, their chances bottomed out at 0.2 percent in the third quarter, when they trailed 28-3. By contrast, Trump’s chances were 29 percent on election morning, according to FiveThirtyEight’s polls-only model. Those are really different forecasts; if you trust the models, Trump’s Election Day victory was more than 100 times likelier than Tom Brady’s comeback.
Of course, there are a few things to debate here. For instance, Trump’s overall rise to the presidency — from the moment he descended the elevators at Trump Tower onward — is still astonishing, even if he was only a modest underdog on Election Day itself.
And on the football side, one can argue about exactly how unlikely the Pats’ comeback really was. As I mentioned, according to ESPN’s win probability model, the Falcons’ chances topped out at 99.8 percent. Their chances were slightly higher still according to the Pro-Football-Reference.com win probability model, reaching 99.9 percent at several points in the third and fourth quarters. Conversely, the Falcons’ odds peaked at about 96 percent according to Las Vegas bookmakers.
While 96 percent might not sound that different than 99.9 percent, consider the numbers in terms of odds: It’s the difference between a 1-in-25 chance of a Patriots’ comeback and a 1-in-1,000 chance. (I’d recommend this “trick” as a sanity check whenever you encounter a probabilistic forecast — restate the probability in terms of odds. That usually makes it much clearer how far out you’ve ventured onto the edge of the probability distribution.) In the context of a football game, maybe that doesn’t matter very much. But it’s crucially important to distinguish a 1-in-25 chance from a 1-in-1,000 chance in life-and-death matters, such as when considering a risky medical procedure or assessing the likelihood of an earthquake in a given region.
For various reasons, ranging from p-hacking, to survivorship bias, to using normal distributions in cases where they aren’t appropriate, to treating events as being independent from one another when they aren’t — that latter one was a big problem with election models that underestimated Trump’s chances — people designing statistical models tend to underestimate tail probabilities (when probabilities are close to zero, but not exactly zero). Perhaps not coincidentally, people also tend to underestimate these probabilities when they don’t use statistical models. It’s not always the case, but it’s often true that when a supposedly 2 percent or 0.2 percent or 0.02 percent event occurs, the “real” probability was higher — perhaps even an order of magnitude higher. Maybe an ostensibly 1-in-500 event was really a 1-in-50 event, for instance.
Sports models are sometimes exempt from these problems because sports are data-rich but structurally simple,1 which makes modeling a lot easier. Still, it’s noteworthy that Super Bowl bettors were picking up on something that the models weren’t considering. Should the models have done more to account for the quality of the trailing team’s quarterback (i.e., some sort of Tom Brady clutch factor?). Should Super Bowls be treated differently from other sorts of football games? (I make a case here that Super Bowls might be different from regular-season games.) These models have also been somewhat overconfident historically, according to Josh Levin at Slate. I don’t know enough about the nuts and bolts of NFL win probability models to render a firm judgment either way, but these would be things to think about if your life depended on whether the Patriots’ chances were really 0.1 percent or 4 percent.
Nonetheless, even if the models didn’t get things quite right, it was almost certainly correct to say that history gave the Patriots long odds. We can be confident about this because of the basic gut-check that I described before: There have been lots of NFL games, and comebacks like the one the Patriots made have been very rare. For example, according to the Pro-Football-Reference.com Play Index, there have been 364 games since 1999 (not counting Super Bowl LI) in which a team trailed by between 24 and 27 points at some point in the third quarter. (The Patriots trailed by 25.) Of those, the trailing team came back to win in only three games, or 0.8 percent of the time. There are various aggravating and mitigating factors you’d want to consider for Super Bowl LI — like that the Patriots had Tom Freakin’ Brady — so I’d guess that those models had New England’s chances too low. Even so, the Pats’ comeback required near-perfect execution, as well as some uncanny luck.
But for the election? That empirical gut-check you can often use for sports — that we’ve been in this situation hundreds or thousands of times before and that we know from history that it almost always turns out a certain way — doesn’t really work. For one thing, there’s not much data. Presidential elections are rare things; FiveThirtyEight’s model was based on only 11 previous elections (since 1972), and others were based on as few as five elections (since 1996). Furthermore, despite the small sample size, there were a fair number of precedents for a modest polling error that would nonetheless be large enough to put Trump into the White House. (In the end, Trump beat his polls by only 2 to 3 points in the average swing state, but that was enough to win the Electoral College). That’s why we thought it was sort of nuts to give Trump less than a 1 percent chance or a 2 percent chance on election morning, where some other models had him. Those were less empirically driven probabilities than adventurous extrapolations drawn from thin data.2
All right, I don’t want to relitigate the election model wars. Besides, there are some more interesting questions here, such as whether supposedly low-probability events are happening more often than they “should” happen. I don’t mean to suggest that there’s been some sort of a glitch in the matrix. (Although if a 16-seed wins the NCAA tournament, I’ll be ready to conclude that the people who designed the Planet Earth simulation are just toying with us.) But it’s worth asking whether statistical models are systematically underestimating tail risks or if this is just a matter of perception.
My answer is probably some of both. On the one hand, people — and I don’t just mean Twitter eggs or your 6-year-old nephew, but lots of well-credentialed people who ought to know better — don’t really have a strong grasp of probability. A 20 percent chance is supposed to occur 20 percent of the time, for instance. That’s not anywhere close to being a tail risk; that’s a routine occurrence (20 percent of weekdays are Mondays).
A 20 percent probability is really different from a 2 percent probability, however, and really, really different from an 0.2 percent probability. We’re not talking about some fine point of distinction; lumping them together is like equating a carrot to a cheetah because they’re both organisms that begin with the letter “c.”
The further a model gets out on the tail of the probability distribution, the more evidence you ought to demand from it — and the more you ought to assume it was wrong rather than “unlucky” when a supposedly unlikely event occurs. To say something has only a 0.2 percent chance of occurring is a really bold claim — in some sense, 100 times bolder than saying it has a 20 percent chance. To be credible, claims like these require either a long track record of forecasting success (you’ve made hundreds of similar forecasts and they’ve been well-calibrated) or a lot of support from theory and evidence. In the hard sciences — and sometimes in sports — you often have enough data (and strong enough theoretical priors) to defend such claims. But it’s rarer to encounter those circumstances in electoral politics, economic forecasting and many other applications involving social behavior.
So if it seems like a lot of crazy, low-likelihood things are happening, some of it is selective memory — focusing on the highly memorable times the long-shot probability came in and ignoring all the boring ones when the favorite prevailed. And some of it is people plugging things into the narrative that don’t really fit.3 Some of those crazy happenings, however, were never quite as unlikely as billed.