We’re forecasting the U.S. Open using a modified version of Elo, a tool that was originally developed for chess but that FiveThirtyEight often uses to generate power ratings and make predictions in sports that consist of head-to-head matches between individuals or teams. Elo is particularly well-suited for evaluating tennis players, and we’ve used it before to examine the place in history of Serena Williams, Novak Djokovic and other current greats.
For the U.S. Open, which is played on hard courts, we’re using two separate Elo ratings — one based on all of a player’s matches and one based only on matches on hard courts — that have been blended to produce initial ratings tailored to this event (more detail on this below).
Here’s how the ratings work: At the start of a match, both players have a rating that is based on their previous match results (all the way back to their first matches on the tour).1 The difference between their ratings is then used to predict the likelihood that each player will win the match.2
After the match is over, both players’ ratings are updated based on how likely the result was: The more unlikely the result, the more the ratings change to reflect that.
We’re forecasting every match of the 2016 men’s and women’s U.S. Open tournaments. See our predictions here »
This is the step in the process where our ratings differ a bit from other Elo variants. All Elo systems take the difference between a player’s match result (0 for a loss or 1 for a win) and the number of wins expected (a number between 0 and 1 that represents the player’s pre-match probability of winning), and then do something with that difference to determine how to adjust the player’s rating after the match. The crudest thing to do is to multiply that difference by some constant number, \(K\), where \(K\) is chosen empirically to match the context (e.g., in chess, a common \(K\) for new players is 40, meaning that if a player won a match that she had a 90 percent chance of winning, she would gain 4 points in Elo.3 More nuanced approaches use a \(K\) function that depends on other variables, such as the number of games a player has played.
In our version, the multiplier is determined by a function in the form
where \(K\) is a constant multiplier much like that used by many other systems, \(offset\) is a small adjustment to keep new players from shooting up or down too much, and \(shape\) tells us what shape to give the curve showing the relationship between the number of matches we have for a player and how much the next match could influence her ratings (essentially, the larger the number, the more stable the ratings for players with lots of matches). Using this framework, we tested to see which parameters performed best over our data set (generally meaning “made the best predictions,” though with safeguards to prevent overfitting and to make sure the predictions are working well for all different types of matches). The values that we settled on are a \(K\) of 250, \(offset\) of 5 and \(shape\) of 0.4. The data we used is from Jeff Sackmann’s GitHub page and his website tennisabstract.com, for a total of more than 265,000 matches at the tour level since 1968.
A player’s rating at the beginning of this year’s U.S. Open is based on all tour-level matches that she has played up to Aug. 28. We use players’ Elo ratings and places in the draws to forecast both the men’s and women’s singles tournaments, based on the bracket structure and conditional probabilities. As our forecasts progress, players’ virtual ratings rise and fall as well, so they reflect some chance that a player gets hot, improves or was underrated to begin with. This allows us to establish for each player a probability of winning the title that covers a lot of scenarios other than just getting lucky. We’ll update our Elo ratings soon after each match in the Open ends and then rerun our forecasts to get fresh probabilities.
And that’s it.
We’ve made our modeling choices to be broadly applicable, to be maximally predictive and to carry the lowest risk of overfitting. This includes making some difficult and sometimes counterintuitive decisions. For example, treating Grand Slam match results as more important than low-level tour results would seem fitting — especially for men, whose Slam matches are best-of-5-sets, unlike in other men’s tournaments, where matches are best-of-3. But so far, we’ve found that doing that makes predictions slightly less accurate, so we didn’t. Or, we could calculate ratings by treating every set (or game) as its own separate competition, much like we incorporate margin of victory into our Elo ratings for basketball and football. That would treat a 6-0, 6-0 match result — two sets to none, 12 games to none — as way more significant than a 7-6, 6-7, 7-6 result (two sets to one, 20 games to 19). But the question when switching to a more granular metric than wins and losses is whether you gain more in detail than you lose in accuracy. So far, in this case, switching to sets or games has tested out to be unhelpful predictively.4
One adjustment that we have found to be predictively useful, however, is accounting for how good players have tended to be on the various surfaces on which they play. So to come up with players’ starting Elos for the U.S. Open, we combined their overall Elo ratings with an Elo rating based only on their previous hard-court matches.5
Note that Elo isn’t without limitations. It makes every head-to-head prediction based solely on the two players’ ratings, which, in turn, are affected only by previous match results. This means that a lot of information (such as injuries, past opponents’ results in later matches against other players, and aging) is ignored. But Elo’s ruthlessly Bayesian design — adjusting ratings based on the prior likelihoods of outcomes — makes it surprisingly accurate and extremely flexible. Match by match, it builds a pyramid of greatness, with players moving closer to the top with every win and closer to the bottom with every loss.
Check out our U.S. Open predictions.