How FiveThirtyEight’s NCAA Tournament Forecasts Did

We love making predictions at FiveThirtyEight, but we know that they’d be meaningless if we didn’t check to see how accurate they turned out to be. Over the past month, we forecast every NCAA college basketball game for the women’s and men’s tournaments, updating our numbers almost a hundred times as 132 teams were pared down to two champions.

So let’s relive the NCAA tournament not through “One Shining Moment” but through a vetting of FiveThirtyEight’s forecasts.

MEN’S

One way to assess our accuracy is to pit ourselves against Vegas, the gold standard of sports predictions. We took our final probabilities from before each March Madness game — converted to a point spread¹ — and used them to place hypothetical bets on the final Vegas line. For example: Vegas had Texas at -2.5 over Butler, but we gave the Longhorns just a 54 percent chance of winning. This implied a line of -1, so we bet on Butler to cover. (And we won!)

Here’s how our hypothetical bets did through the round of 64. Note that if our implied line matched Vegas’s exactly, we wouldn’t place a hypothetical bet on that game.

A 17-13 record is strong — enough to maybe make some money. More importantly, our model performed well when it differed significantly from Vegas. In seven games, marked in gray above, our model had a perceived edge of three points or more — a sizable gap in forecasting between the two basketball forecasts. In these games, our bets went 5-2.

After the round of 64, things didn’t go so great:

Through these last 31 games, our spreads tracked Vegas very closely. We disagreed on the favorite just once (correctly picking Notre Dame to beat Wichita State), and there were no games where our perceived edge was three points or greater. In 21 games, our edge was half a point to a point, a relatively small advantage to place actual money on. Luckily we didn’t, as these hypothetical bets went 6-15. When our perceived edge was greater than a point, our bets went 3-3. That gave us an overall March Madness record of 26-31, with 10 no bets.

Betting against Vegas is great, but we’re aware that many of our readers stopped looking at our forecasts the moment they turned in their bracket. This means that the performance of our first, pre-tourney predictions was especially important.

If you built a men’s bracket using only FiveThirtyEight’s initial numbers, you would have gotten 70 percent of the games correct. It’s hard to tell if we should be happy with that number — it’s certainly not going to impress sammyholtz16 or anyone else on top of the ESPN Bracket Challenge leaderboard. A better way to assess probabilistic forecasts is a calculation known as a Brier score, which we can use to compare ourselves to other models like Ken Pomeroy’s, Jeff Sagarin’s,² The Power Rank, ESPN BPI, and numberFire.

A short, but still mathy, example of how Brier scores work: Our initial predictions gave No. 6 seed Xavier a 26.5 percent chance of advancing to the Sweet 16, while Ken Pomeroy’s log5 model gave them a 24.8 percent chance.³ Xavier did advance, so our Brier score for that event was (1 – 0.265)^2, or 0.54, while Pomeroy scored a 0.57. Being closer to zero is better, so we were (slightly) more accurate for that prediction.

We calculated each model’s average Brier score for each round of the tournament. To benchmark all the results, we added a “chalk” model that assigned a 100 percent probability to the higher seed winning each game.⁴ Here are the results, including an overall average:

FiveThirtyEight came in third, and all the models beat the chalk. Our forecasts started strong, with the lowest Brier score for the round of 64. We assigned the highest probabilities to Georgia State’s and Dayton’s upset chances — 24 and 38 percent, respectively. The Dayton pick — calculated before the play-in games began — was a coup for the geographic component of our calculations: Dayton had a home game against Boise State for its play-in, and then had to travel just a few dozen miles for its round of 64 matchup.

NumberFire came in second, leaning heavily on favorites throughout the tournament. This killed numberFire early on — it gave Iowa State a 96 percent change over UAB — but this confidence paid off as the bracket became chalkier and chalkier. Other models, like Pomeroy’s, consistently underrate the favorites’ probability of winning, leading to weaker Brier scores.

Once Duke started rolling, The Power Rank pulled away. Among the models we looked at, the site gave the Blue Devils the highest probability of advancing to the Sweet 16 (88 percent), Elite Eight (71 percent), Final Four (46 percent), and Championship game (27 percent). Duke was given a 12 percent chance to win the title at the tournament’s start. To compare, our model and numberFire each put Duke at six percent, while BPI had it at just three percent. Given that Duke is traditionally undervalued in bracket pools, fans who built their brackets on The Power Rank’s numbers likely had a pretty good tournament.

WOMEN’S

Vegas might be the gold standard of sports predictions over in the men’s tournament, but unfortunately we couldn’t find a reliable source of betting lines for the women’s NCAA tournament (nor could we find them for any games other than the championship). This year was the first time we made March Madness predictions for the women’s tournament, and we knew there would be less data available for making the women’s model.

Because there’s no major alternative we can use to judge our forecast, we were left with an imperfect measure: Did it do better than if we had just picked favorites? It’s a draw! if you built a women’s bracket using only FiveThirtyEight’s initial numbers, you would have gotten 84 percent of games correct, compared to … 84 percent if only favorites won.

But Brier scores can help settle the tie. Here’s what happened when we compared our probabilistic results to the all-favorites picks.

Our model outperformed the chalk model in the opening rounds (as it should have), but in some matchups it did more than just outperform. We gave DePaul, a No. 9 seed, a 77 percent chance of upsetting Minnesota, a No. 8 seed, and the Blue Demons won by seven. In the Spokane region, we gave Gonzaga a 48 percent chance of upsetting George Washington, nearly a coin toss, and sure enough, Gonzaga made a strong run all the way to the third round.

We didn’t predict the Dayton upset of Kentucky (we gave them only a 22 percent chance of making it that far), and by the Sweet 16, the chalk model began to outperform us. Perhaps our model’s biggest oversight was placing too much weight on South Carolina making it to the national championship over Notre Dame, which was due in part to the geographical advantages we assigned to their playing in Greensboro. Both teams were No. 1 seeds, but Notre Dame was the higher 1 seed. We gave South Carolina a 40 percent chance of making it to the finals, and Notre Dame only a 34 percent chance. This was a very chalky tournament to begin with, however, with heavy, heavy favorites like UConn and the rest of the No. 1 seeds dominating the bracket.

In total, however, we outperformed the chalk, and we’re pleased with how our first women’s March Madness predictions did. We’ll have more data to build on for next year, plus a better understanding of the fat-tailed distributions particular to the women’s tournament — and we’ve got a pretty good idea of who our model will favor again.

CORRECTION (March 15, 2016, 12:10 p.m.): An earlier version of the first footnote in this article misstated the equation used to calculate implied point spreads from win probabilities. This calculation used a mean of zero, not one. The numbers that appeared in the article were not affected.

Footnotes

The calculation for converting a win probability to a point spread in R: -qnorm(win_prob, mean = 0, SD = 10.36). That standard deviation is derived from a model Nate Silver built several years ago, based on about six years’ worth of tournament games. We then rounded to the nearest half point.
Sagarin just produces a power ranking, so for that model we used the converted probabilities that appeared in The New York Times.
It’s sort of unfair to compare our forecast to Pomeroy’s, as our model incorporates his numbers as one of our variables. Pomeroy’s ratings and Jeff Sagarin’s power rankings can be used to forecast the tournament, but they were built to be useful throughout the entire college basketball season; they’re intentionally more generalized.
For games between teams with the same seed, the S-curve was used as a tiebreaker. Not every model predicted the play-in games, so for consistency we used every site’s predictions from before the First Four, but only measured the accuracy of predictions from after those games were played.

MEN’S

WOMEN’S

Footnotes

Comments