Quarterbacks ruled the 2019 NFL season, with Patrick Mahomes bringing the Lombardi Trophy to Kansas City and Lamar Jackson emerging as the league MVP. Quarterbacks were in control of the FiveThirtyEight prediction model, too, as a key factor of the new version of our Elo rating system, which adjusted for the performance of every starting QB. Now that the season is over, in the spirit of checking our work, we wanted to look back at the 2019 season and see how well the new system did — and whether it improved on our old, simple Elo system from years past.
One simple way to judge prediction accuracy is to look at how close the predicted point spread came to the actual score differential of each game (squaring the errors to give a larger penalty to bad misses). And in that department, new Elo beat old Elo this season, albeit by a smaller margin than we might have expected based on the preceding five seasons.
But our preferred way to judge the accuracy of a forecast is using Brier scores, which are essentially the average squared error between a probabilistic forecast and what actually happened.1 (Lower Brier scores are better because they mean your prediction was closer to being correct.) And by that standard, our new Elo ratings basically performed as expected. It was a bit of an unpredictable NFL season according to either system, particularly during the playoffs, but the improvement in Brier score from the old version of Elo (0.224) to the new Elo (0.219) by the end of the 2019 season ended up being almost exactly what it had been when it was backtested over the previous five seasons, on average:
Using Brier scores, let’s look at how the model’s accuracy evolved over time. Very early in the season, new Elo had an edge, perhaps because it was accounting for the many quarterback injuries that beset teams during the first few weeks. Then things in the league got weird. And the old system — which didn’t adjust for QBs, travel distance or rest days — was actually handling the weirdness better for most of the first half of the year. The new model didn’t pull ahead for good in terms of seasonlong Brier score until Week 11, at which point it maintained a lead and even expanded it, with injuries and teams resting starters in the closing weeks of the schedule.
The playoffs were a bit rough for the new model, primarily because of two games: Seattle at Philadelphia in the wild-card round (where new Elo’s Brier was 0.480, compared with 0.380 for the old model) and Tennessee at Baltimore in the divisional round (new Elo’s Brier was 0.755 — really bad! — compared with 0.582 for the old system). Our backtesting suggested that there are real predictive effects to late-season QB hot and cold streaks, and that favorites tend to play better in the postseason, but both of those factors ended up haunting the new model in that pair of upsets. Overall in the playoffs, new Elo had a worse Brier score (0.272) than the old model did (0.261) — although, as we mentioned earlier, that didn’t really cause it to do worse than expected for the entire season overall. And, of course, it also helped that the new system did much better in the conference championships and the Super Bowl.
Finally, just for fun, let’s look at the games in which the new model had its best and worst picks of the season, relative to the old system:
|Hits:||Winner||Loser||Winner’s W% by Elo|
|Misses:||Winner||Loser||Winner’s W% by Elo|
Unsurprisingly, most of these examples revolved around backup quarterbacks, for good or bad — either because the regular starter was knocked out (which old Elo didn’t know about) or because he was returning after a long absence. Sometimes adjusting for this resulted in an overcorrection, such as when Pittsburgh was down to third-string QB Devlin Hodges in Week 6 yet somehow managed to still win. But more often it helped, such as when Mahomes went down and Kansas City lost with Matt Moore at the helm in Week 8.
So overall, we think new Elo had a solid rookie season, and the new changes helped the model’s predictions. Although there are a few areas of improvement to potentially investigate over the offseason, it was encouraging that the new system outperfomed the old system by almost precisely what we expected based on our backtesting. It was also a good sign that the model was able to consistently outpredict the average reader in our forecast game, “winning” all but two weeks of the season and continuing the old system’s pattern of dominance over the field from previous seasons:
|Week||Games||Avg. Net Pts||Week||Games||Avg. Net Pts|
Speaking of which, congrats to Jordan Sweeney, who led all readers in the postseason with 275 points, and to Griffin Colaizzi, who used the Super Bowl to pull ahead and win the full-season contest with 1,126.2 points. And a big thanks to everyone who played all season! We can’t wait to fire up the model again in about six months and try to get that Brier score even lower next year.