Some Do’s And Don’t’s For Evaluating Senate Forecasts

How right will our Senate forecast be? There won’t be one correct answer, but there are good ways to approach the question and there are bad ways.

Counting the number of races we got “right” isn’t one of the good ways. FiveThirtyEight called the correct winner in all 50 states in 2012, but that doesn’t really tell us anything. That’s because if our model is right about how uncertain it is about the closest races, and if a miss means a candidate with the greatest chance of winning loses, then it should miss two or three races. We know we’ll probably miss some races; we just don’t know which ones.

Imagine a baseball lineup of nine hitters who each have a 25 percent chance of getting a base hit each time they come to bat. For each batter’s first time up, the smartest prediction is that he won’t get a hit. Yet we’d expect, on average, 2.25 hits from the first time through the lineup.

In his “Do the Math” column at Slate, Jordan Ellenberg did the math: He added up the probability our model assigns to the trailing candidate in the 12 closest races. Ellenberg figured out that if the election unspooled 100 times, we’d miss 246 races, meaning we’d expect to miss an average of 2.46 calls. That number changed slightly after Ellenberg’s column ran: According to our final forecast, we should expect to miss at least 2.37 calls. (I say “at least” because for all calculations here I’m not including races in which the trailing candidate has less than 1 percent chance of winning, though over enough elections eventually one such seemingly sure thing won’t come through.)

The story is similar with other forecasts. For instance, based on the Upshot’s forecasts Tuesday, it can expect to miss about 2.7 races; HuffPost Pollster can expect to miss about 3.6 races. Even The Washington Post’s Election Lab, which is 98 percent certain Republicans will take control of the Senate and at least 79 percent confident in every race outside Kansas, can expect to miss about 1.25 races. Of course, that’s assuming all these models — including ours — are properly calibrated.

Counting correct calls also won’t go very far in differentiating our forecast from others’. By Election Day, the seven forecasts tracked by the Upshot agreed on every close race but one: Kansas. And there, everyone calls it essentially a tossup.

Small sample size is a problem here, as it is in all evaluations of national election forecasts. According to our model, the likely runner-up has more than a 0.5 percent chance of winning in just a dozen races. So a purely binary measure — did our most likely winner win? — doesn’t yield much information.

Small sample size also stymies checking the calibration of forecasts. A well-calibrated forecast is “right” about 60 percent of the time when it predicts an outcome with 60 percent confidence, and “right” about 90 percent of the time on 90 percent forecasts. Put another way, 10 percent of all 90 percent favorites should lose. With so few competitive races, we just don’t have enough data to demonstrate that we’re well-calibrated. (If we’re way off, we may have enough data to conclude that we’re not well-calibrated.) We’re predicting just two Senate races with certainty between 70 and 80 percent: Colorado and Louisiana. We’d expect to get 1.5 right. If we get one or two right, we haven’t learned much about the calibration of the model.

We reached out to several people involved in creating other Senate forecasts to get their opinion on the best way to evaluate everyone’s forecasts. Several agreed that counting correct calls was misguided. “A bad way to evaluate forecasts is to check how many winners were above 50 percent,” said Amanda Cox of the Upshot. And several also pointed out the problems with measuring calibration with a small sample size. “With only 36 elections, and most in the top bin [90 percent or higher] this may not be entirely feasible,” said David Rothschild of PredictWise.

Nearly everyone suggested using something called Brier scores, a common tool for evaluating probabilistic forecasts with two possible outcomes, such as, will it rain tomorrow? Or, will Democrats win the Senate race in New Hampshire?

The higher the Brier score, the worse a forecast did. It’s the sum of the squares of the difference between the probability assigned to each correct outcome, and one. So, for instance, our model says that Kay Hagan has a 69 percent chance of winning the North Carolina Senate race. If she wins, our score for that race is (1-0.69)^2, or 0.0961. If she loses, our score is (1-0.31)^2, or 0.4761. Sum those across all races to get our total Brier score.

“Those scores take into account both races called correctly and also the confidence inherent in the forecast,” John Sides, a political scientist at George Washington University and co-creator of The Washington Post forecast, said in an email. “Being 100% confident and wrong is penalized the most. I think that’s a more diagnostic statistic than just races called correctly.”

Evaluations also should encompass each model’s forecast history, not just the latest projection, according to Drew Linzer, co-creator of the Daily Kos forecast. “Whatever you do, calculate all these numbers over time, not just what everyone’s predicting on Election Day,” he said.

My colleague Andrew Flowers analyzed how the Brier scores of the seven forecasts compiled by the Upshot look depending on certain Election Day outcomes. The forecasts have all converged to similar predictions. If the outcomes they consider most likely prove true, the models that are most certain about them will do best. (Kansas won’t have much of an effect because everyone sees it as roughly 50-50.) But in September, the models disagreed more sharply, giving rise to a wide range of possible Brier scores.

Several election forecasters also suggested checking the root mean square error of predicted winning margins. After all, predicting that a candidate wins by 1 percentage point in a race she takes in a landslide is arguably much worse than that same prediction when the candidate loses by 1 percentage point. Yet the first looks better by Brier scoring.

It’s not so simple to use this method to compare forecasts, because not every forecaster reports projections for vote margins and “because of how different models handle undecideds and/or minor-party candidates,” Linzer said.

Evaluating vote margins has one big advantage over Brier scoring: It doesn’t depend on knowing who won each race. We may not know the winners of several races for a while, but we should have a pretty good idea of the vote margins by the early morning hours Wednesday.

Carl Bialik was FiveThirtyEight’s lead writer for news.