As FiveThirtyEight’s staff watched U.S. Open results come in Monday and change the probabilities in our forecast — our first ever for a tennis tournament — we started asking ourselves some questions. Such as: Why did we think Serena Williams and Novak Djokovic looked dominant in our model when betting markets weren’t nearly as confident in the favorites? Since we were chatting about it anyway, we decided to have a chat worth publishing. (All numbers are as of when we talked on Tuesday afternoon, after Djokovic’s first match but before Williams’s.) It’s below, lightly edited.
Kyle Wagner: (sports editor): So FiveThirtyEight has U.S. Open predictions for the first time, and I’m sure lots of folks have questions about how they work and, more important, why they’re any good. Someone want to give us the 30-second version of how they’re made?
Ben Morris (writer/researcher): For player strength and individual match win probabilities, we use our tennis Elo ratings system, tailored to a hard-court tournament like the U.S. Open.
Jay Boice (computational journalist): Then we take those Elo ratings and head-to-head win probabilities along with the bracket structure and calculate the chance that each player will reach each round, who their likely opponents would be in that round, and how those opponents would affect their Elo rating. A big tree of conditional probabilities …
Kyle Wagner: And that’s basically the same way that we forecast NBA and NFL seasons, yeah?
Ben Morris: Pretty similar, yes. Though Elo works somewhat differently in individual sports like tennis than in league sports like the NBA.
Reuben Fischer-Baum (visual journalist): One big difference: Tennis Elo doesn’t account for margin of victory, plus some other, more technical differences (NBA and NFL are simulation-based).
Ben Morris: This is true, but not necessarily a limitation in my view. Trying to account for margins in tennis often leads to worse predictions. Winning actually matters! Most of the information is carried in who wins and loses, and the information beyond that isn’t super reliable.
Reuben Fischer-Baum: I agree with Ben on that. In other major sports, margin of victory tracks much more nicely with quality (with some weird exceptions, like the 2015-16 Golden State Warriors, who had a surprisingly consistent margin regardless of opponent strength).
Kyle Wagner: What would accounting for matchups look like?
Carl Bialik (writer): One challenge is the sample size: Most players don’t play any other specific opponent all that often. I’ve wondered if you could overcome that by accounting for matchup style: building taxonomies of player types like we’ve done for NFL quarterbacks and see how certain players do against, say, tall players with big serves or small ones with great backhands and speed.
Kyle Wagner: One thing we’ve seen with this projection is that our model likes Serena Williams and Novak Djokovic a lot more than the betting markets. Do you think that’s mainly because of those differences? Or is it something more basic, like the length of a tournament or the fact that Williams’s and Djokovic’s health is uncertain?
Reuben Fischer-Baum: In terms of the betting markets, tennis Elo, like NBA and NFL, isn’t accounting for injuries. That could make a big difference! For reference: Betfair has Djokovic at around 36 percent right now and has Djokovic and Andy Murray more or less neck-and-neck.
We’re forecasting every match of the 2016 men’s and women’s U.S. Open tournaments. See our predictions here »
Carl Bialik: I agree. The betting markets were showing Djokovic and Williams as odds-on favorites for the U.S. Open after Wimbledon. Their Elo ratings were a little higher then — they both lost early at the Olympics in Rio de Janeiro — but the bigger change is that they both are struggling with injuries. Here, reporters and fans are reporting from their practices — and canceled practices. Djokovic looked rusty early in his first match, better by the end. Elo doesn’t care about any of that. It just knows he survived and advanced.
Reuben Fischer-Baum: If we’re willing to say that the betting market maybe overcorrects for injury/margin/rustiness (and I’m not sure we are), Djokovic’s match on Monday night might be a good one to point to. The announcers couldn’t stop talking about how he looked rusty, and he dropped a set, but he still stomped the guy in the end. The match itself was never really in doubt.
Carl Bialik: I was watching behind another writer who kept turning away from the court and to me to say how terrible Djokovic looked as he hit winners and coasted in the last two sets.
And it was hilarious after the match when ESPN’s Tom Rinaldi tried to get Djokovic on court to say anything specific about his wrist and Djokovic kept changing the subject to the stadium, the crowd, Phil Collins — didn’t want to give those bettors any info to overreact to. Although I won’t be too quick to dismiss the betting markets, not when Djokovic has to win six more matches and they won’t all be against opponents as overmatched as last night’s.
Ben Morris: FWIW, our model is definitely more bullish on Djokovic/Williams than I expected, even before the injury issues. I think this is largely due to the lack of strong second/third tiers that normally grind down the favorites’ chances over the course of a tournament.
Jay Boice: Yeah, Elo really, really likes Williams and Djokovic. For example, Williams is about 260 Elo points better than her nearest competitor (Simona Halep) and the rest of the field. That’s kind of like filling the NBA playoffs with the Warriors and all the teams who didn’t make the playoffs last year.
Reuben Fischer-Baum: Betfair has Williams at 38 percent to win it all, but Angelique Kerber, the second favorite, at just 13 percent — a much bigger gap than the betting odds in the men’s field.1
Ben Morris: Generally, if I model something and there’s a small gap with betting markets, I might think, “Yeah, I’m doing it better.” But if there’s a big gap, I think, “There’s probably something my model is missing.”
Reuben Fischer-Baum: I think injuries are a big deal! This is a pretty obvious point, but an injury in tennis means a lot more than an injury in basketball or football because … there’s just one player.
Carl Bialik: Another reason to be a little surprised by the confidence of the model is that players have to win seven matches in a row. Even winning March Madness takes just six. Though favorites at majors get a lot of protection in the draw.
On the other hand: Williams has won nine of the last 17 majors. Djokovic has won six of the last nine. Cherry-picked end points and all that, but more often than not lately, they’ve both won. And when they haven’t, they usually have come really close.
Kyle Wagner: Is the weak second or third rung the case for just this tournament, or would it be the norm if we did predictions for every major?
Carl Bialik: It’s been the norm during this current age of Williams and Djokovic — but particularly here with Roger Federer, Maria Sharapova and Victoria Azarenka out.
Kyle Wagner: Do injuries to players like Federer or suspensions like Sharapova’s — big chunks of the competitive ecosystem — throw off a zero-sum model like Elo in an outsize way, or should the model be able to adjust for that?
Jay Boice: Injuries are just so varied — it’s tough to quantify them and fit them into a model …
Reuben Fischer-Baum: I don’t think players missing the tournament throws off Elo though.
Ben Morris: Well, our Elo isn’t a zero-sum model. Players missing shouldn’t throw it off in any way. Nor should players retiring, etc. In the long run, the points they take off the table get picked up by new players with the more rapid adjustment to their ratings.
Reuben Fischer-Baum: I have a question! So Djokovic has a much higher Elo rating than Andy Murray, which fits how you might think about their two careers, but not necessarily how you’d think about their 2016 performances. Is it possible that part of the difference with the betting markets is that Elo is less reactive?
Ben Morris: Well, with or without matchup style, history between players is relevant information that at least in some circumstances can be predictively useful. That is definitely something that could be incorporated, even if the effect is small. But more is possible.
Reuben, I think you can definitely say that part of the difference is likely that betting markets ARE more reactive than our Elo to recent performances, especially for players with long careers like these two. A very different question, however, is whether that’s right. Our Elo adjusts slowly for grizzled veterans for a reason — because it works.
Carl Bialik: I agree. But also I think markets can make too much of streaks and titles. Murray won 22 in a row recently, but none of those came against Djokovic or Nadal, the two guys we think are the best men besides Murray in the draw. A win over Djokovic is the best way for Murray to gain Elo points and catch up. But he’s lost 13 of the last 15 to him.
Ben Morris: Over the course of a long career, players have hot streaks and cold streaks, and when those come later in a player’s career, they mean less. The function we use to update ratings after matches reflects that, and makes better predictions overall as a result. Or put another way, if Murray’s hot year really reflects a huge jump in his ability, that would be the exception to the historical rule.
Reuben Fischer-Baum: This might not be an answerable question, but how far back do you have to go (for Murray and Djokovic) before matches are making negligible impacts on current Elo ratings?
Ben Morris: OK, so perspective: When Andy Murray beat Djokovic at the Rome Masters in May, he gained 13.9 Elo points; that was in the final. When he beat Lucas Pouille in the semifinal, he gained 2.1 points. Elo is unimpressed by beating people you’re supposed to beat.
Carl Bialik: The faster way for Murray to catch up is for Djokovic to lose more to guys like Sam Querrey, who beat him at Wimbledon.
Reuben Fischer-Baum: But Murray’s behind Djokovic by like 170 points!
Ben Morris: Yes.
Reuben Fischer-Baum: So that means that he’d have to beat Djokovic in like 12 straight finals to pass him?
Ben Morris: Djokovic lost 13 points in Rome. So like six straight finals.
Reuben Fischer-Baum: Ah, right. Well a little more, because they’d gain and lose less the closer they get in Elo?
Ben Morris: Yes. Quick — someone run the simulation on a Murray v. Djokovic only tournament!
But there’s also some sense in that. Just because Murray beat Djokovic a bunch of times doesn’t necessarily mean he’s the better player. See, e.g., Nadal and Federer.
Carl Bialik: Headline this chat: Federer Is Better Than Nadal Even Though He Always Loses To Him and people will click.
Reuben Fischer-Baum: But Nadal-Federer was a surface thing, right? Or nah?
Ben Morris: Nadal beat Federer more than he was supposed to on every surface.
Carl Bialik: Nadal almost always beats Federer on clay but also is 9-7 against him on hard courts.
Kyle Wagner: That gets into why you have a system, though. If we’re pretty sure that one player is the best in the world and another player beats the shit out of him every time they play, this should inform a prediction on what happens in their next match, no?
Ben Morris: Well, that’s what we were chatting about. It would be a nice feature to add. But I suspect that there are few cases in which it would make a significant difference. Even Nadal vs. Federer — like the most famous example in all of tennis — wasn’t completely outside the realm of variance.
Reuben Fischer-Baum: It certainly wouldn’t boost Murray’s chances in our interactive at the moment.
Carl Bialik: Not against Djokovic or Nadal in a possible final, anyway. Murray owns most guys in his half of the draw.
Kyle Wagner: OK, “should we have a thing that is better” was probably not the right question — but does the fact that we don’t have a mechanism in place that can deal with that mean we think our projections are more effective in a Player vs. The Field scenario than they are in individual matchups?
Ben Morris: No. I think our model is pretty solid for individual matchups, with the caveat that occasionally one player who has dominated another player may be given too low of a chance. But I think those situations are rarer and mean less than people may think.
Carl Bialik: I have a question. How will we judge how well this model did? In addition to whether or not Williams and Djokovic win, which is how everyone else will judge it (validly).
Posted this Monday night:
Jay Boice: A good way to judge the model is just to look at its calibration: Did players with an X percent chance of winning a match actually win X percent of the time?
Reuben Fischer-Baum: Damn, we’re killing it. Next question.
Ben Morris: As with all models that give win percentages, evaluating its performance is tricky. You want the predicted winners to win as often as possible, but you also want people predicted to win 70 percent of the time to win 70 percent of the time, etc.
Incidentally, those goals can sometimes be at odds. What if one model predicts the correct winner 70 percent of the time, but another model predicts the correct winner only 69 percent of the time, but the 80 percent guys win 80 percent and the 50 percent guys win 50 percent, etc. Note: For simulation purposes, having the second of those is almost certainly better.
Reuben Fischer-Baum: In a much blunter way, our model will sort of inevitably be judged by the performance of Djokovic and Williams at this point — not that that would be our preference!
Jay Boice: You can also throw Brier score in there, Ben, and sometimes that’s also at odds with predicting winners and calibration.
Ben Morris: Yeah, I mean, the odds of one of Djokovic/Williams winning and the other losing are greater than the odds of both of them winning.
Kyle Wagner: Maybe we should start headlining like that.
Ben Morris: Yet, if we miss one, Twitter will be all, “LOL 538.”
Reuben Fischer-Baum: The lesson is to not predict stuff.
CORRECTION (Aug. 31, 12:10 p.m.): An earlier version of this article gave the incorrect winner of the 2016 Wimbledon men’s singles title. It was Andy Murray, not Novak Djokovic.
Check out our U.S. Open predictions.