How FiveThirtyEight’s 2018 Midterm Forecasts Did

Nate Silver reviews the FiveThirtyEight Midterm forecasts

On Nov. 5, the night before last month’s midterms, I got dinner with Sean Trende from RealClearPolitics. Over the years, Sean and I have learned to stare into the abyss and play out various “unthinkable” scenarios in our head. Sure, it was unlikely, but what if Republicans won the popular vote for the House, as a Rasmussen Reports poll conducted just before the election suggested? Or what if Democrats won it by about 15 percentage points, as a Los Angeles Times poll had it? What if polls were just so screwed up that there were a ton of upsets in both directions?

Instead, the election we wound up with was one where everything was quite … dare I say it? … predictable. Polls and forecasts, including FiveThirtyEight’s forecast, were highly accurate and did about as well as you could expect. So let’s go through how our forecast, in particular, performed: I’ll brag about what it got right, along with suggesting some areas where — despite our good top-line numbers — there’s potentially room to improve in 2020.

But before I do that, I want to remind you that our forecasts are probabilistic. Not only are our forecasts for individual races probabilistic, but our model assumes that the errors in the forecasts are correlated across races — that is, if one party’s chances were overrated in one race, they’d likely be overrated in many or all races. Because errors are correlated, we’re going to have better years and worse ones in terms of “calling” races correctly. This year was one of the better years — maybe the best we’ve ever had — but it’s still just one year. In the long run, we want our forecasts to be accurate, but we also want our probabilities to be well-calibrated, meaning that, for instance, 80 percent favorites win about 80 percent of the time.

I say that because we’ve frequently argued that our 2016 forecasts did a good job because they gave President Trump a considerably higher chance than the conventional wisdom did and because our probabilities were well-calibrated. But Trump did win several key states (Wisconsin, Michigan, Pennsylvania) in which he was an underdog, and he was an underdog in the Electoral College overall. So 2016 was good from a calibration perspective but middling from an accuracy (calling races correctly) perspective. This year was sort of the opposite: terrific from an accuracy perspective, but actually somewhat problematic from a calibration perspective because not enough underdogs won. We’ll get back to that theme in a moment.

First, though, I just want to look at our topline numbers for the House, the Senate and governorships. Keep in mind that there are three different versions of our forecast: Lite (which uses local and national polls only, making extrapolations in districts that don’t have polling based on districts that do have polling), Classic (which blends the polls with other data such as fundraising numbers) and Deluxe (which adds expert ratings to the Classic forecasts). Classic is the “default” forecast, but we made pretty extensive use of all three versions over the course of our election coverage, so it’s fair to evaluate and critique them all.

Here’s more detail on the numbers in that chart:

The House. Two House races remain uncalled as of this writing: California 21, where Democrat TJ Cox has pulled ahead, overcoming a big deficit on election night, and North Carolina 9, where Republican Mark Harris leads but the vote hasn’t been certified because of possible fraud in absentee ballots. I’m going to assume for the rest of this article that Cox and Harris will indeed prevail in their respective races.¹

If that’s the case, Democrats will wind up with a net gain of 40 House seats. That’s a big number, but it’s actually not that much of a surprise. In fact, it’s quite close to the mean number of seats that our various forecasts projected: Classic had Democrats picking up an average of 39 seats, Lite had 38 seats and Deluxe had 36 seats.

It’s also important to point out that the range of possible seat gains in our forecasts was wide. In the Classic forecast, for instance, our 80 percent confidence interval — that is, everything between the 10th and 90th percentiles of possible outcomes — ran from a Democratic gain of 21 seats all the way to a Democratic gain of 59 seats. We were pretty lucky to wind up only one or two seats off, in other words. With that said, it isn’t as though our model just threw up its hands and didn’t have an opinion about the election. Although they provided for a realistic chance (between a 12 percent and 20 percent chance in the different versions of the model) of Republicans holding the House, our forecasts were more confident about Democrats than the conventional wisdom was; GOP chances of keeping the House were closer to 35 percent in betting markets, for instance. So we think our House model was on the right side of the argument, in terms of being bullish on Democrats.

Our forecasts also did a good job of projecting the popular vote for the House. As of Monday afternoon, Democrats led the national popular vote for the House by 8.5 percentage points, but this margin has been rising as additional ballots from California, New York and other states are counted, and Cook Political Report’s Dave Wasserman estimates that it will eventually reach 8.7 points. That’s very close to where the Classic and Deluxe models had the popular vote, showing Democrats winning by 9.2 points and 8.8 points, respectively. (It also exactly matches our final generic congressional ballot average of the Democrats ahead by 8.7 points, but note that the estimate of the popular vote in our forecast incorporates factors other than just the generic ballot.) The Lite forecast was a bit too high on Democrats’ popular vote margin, by contrast, showing them winning it by 10.2 points, largely because it overestimated how well Democrats would do in extremely Democratic districts where there wasn’t a lot of polling.

The Senate. Republicans won a net of two Senate seats from Democrats, well within the middle of the 80 percent confidence intervals of all versions of our model, which showed a range between a two-seat Democratic gain and (depending on what version of the model you look at) a three- to four- seat Republican gain. The mean of our forecasts showed Republicans gaining between 0.5 (in Classic and Deluxe) and 0.7 (in Lite) Senate seats, so they did about one-and-a-half seats better than expected, although that’s a fairly minor difference. That difference is essentially accounted for by Florida and Indiana, where Republicans won despite being modest underdogs in our forecast. (I’ll have a table showing the biggest upsets later on in this column.) Meanwhile, each party won its fair share of toss-ups (e.g., Republicans in Missouri, Democrats in Nevada).

Governorships. Our gubernatorial forecast predicted that Republicans were more likely than not to control a majority of governorships after the election² but that Democrats were heavy favorites to govern over a majority of the population because their strengths were concentrated in higher-population states. The forecast was right about each of those propositions, although it was a close-ish call in the population-based measure. Democrats will hold 23 gubernatorial seats once new governors are seated, close to our mean prediction of 24.0 (Classic), 24.2 (Deluxe) or 24.7 (Lite) seats. These states account for 53 percent of the U.S. population,³ which is within the 80 percent confidence interval for our population forecast but is less than the mean of our projections, which had Democrats predicted to govern about 60 percent of the population. The main culprit for the difference was Florida, which accounts for about 6 percent of the U.S. population. Republican Ron DeSantis won there despite having only about a 20 percent chance of prevailing in our forecast.

But while our top-line numbers were quite accurate, what about in individual races? Those were very good also. Between the House (435 races), Senate (35) and gubernatorial races (36), we issued forecasts in a total of 506 elections. Of those:

The Lite forecast called the winner correctly in 482 of 506 races (95 percent).
The Classic forecast called the winner correctly in 487 of 506 races (96 percent).
And the Deluxe forecast called the winner correctly in 490 of 506 races (97 percent).

Granted, a fair number of those races were layups (only 150 or so of the 506 races might be considered highly competitive). Still, that’s better than we expected to do. Based on the probabilities listed by our models, we’d have expected Lite to get 466 races right (92 percent), Classic to get 472 races right (93 percent) and Deluxe to get 476 races right (94 percent) in an average year. It’s also nice that Deluxe called a few more races correctly than Classic and that Classic called a few more correctly than Lite, since that’s how our models are supposed to work: Lite accounts for less information, which makes it simpler and less assumption-driven, but at the cost of being (slightly) less accurate.

Again, though, it isn’t entirely good news that there were fewer upsets than expected. That’s because it means our forecasts weren’t super well-calibrated. The chart below shows that in some detail; it breaks races down into the various category labels we use such as “likely Republican” and “lean Democrat.” (I’ve subdivided our “toss-up” category into races where the Democrat and Republican were slightly favored.) In most of these categories, the favorites won more often than expected — sometimes significantly more often.⁴

How well our Lite forecast was calibrated

Where Democrats were favored
Category	Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up (tilt D)	19	10	12	54%	63%
Lean D	16	11	13	67	81
Likely D	33	29	32	88	97
Solid D	205	204	205	100	100
All races	273	254	262	93	96
Where Republicans were favored
Category	Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up (tilt R)	11	6	4	55%	36%
Lean R	22	15	19	70	86
Likely R	64	56	61	87	95
Solid R	136	134	136	99	100
All races	233	211	220	91	94
All races combined
Category	Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up	30	16	16	55%	53%
Lean	38	26	32	69	84
Likely	97	84	93	87	96
Solid	341	339	341	99	100
All races	506	466	482	92	95

How well our Classic forecast was calibrated

Where Democrats were favored
Category	Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up (tilt D)	13	7	9	56%	69%
Lean D	13	9	10	66	77
Likely D	30	26	29	87	97
Solid D	216	215	216	100	100
All races	272	257	264	95	97
Where Republicans were favored
Category	Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up (tilt R)	12	7	7	55%	58%
Lean R	17	12	15	69	88
Likely R	55	47	51	85	93
Solid R	150	149	150	99	100
All races	234	214	223	92	95
All races combined
Category	Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up	25	14	16	56%	64%
Lean	30	20	25	68	83
Likely	85	73	80	86	94
Solid	366	364	366	100	100
All races	506	472	487	93	96

How well our Deluxe forecast was calibrated

Where Democrats were favored
Category	No. Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up (tilt D)	8	4	6	53%	75%
Lean D	22	14	17	66	77
Likely D	29	25	28	88	97
Solid D	216	215	216	100	100
All races	275	260	267	94	97
Where Republicans were favored
Category	No. Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up (tilt R)	7	4	3	57%	43%
Lean R	10	7	10	68	100
Likely R	52	45	48	86	92
Solid R	162	161	162	99	100
All races	231	217	223	94	97
All races combined
Category	No. Races	Expected Wins for Favorite	Actual Wins for Favorite	Expected % Correct	Actual % Correct
Toss-up	15	8	9	55%	60%
Lean	32	21	27	66	84
Likely	81	70	76	87	94
Solid	378	377	378	100	100
All races	506	476	490	94	97

For instance, in races that were identified as “leaning” in the Classic forecast (that is, “lean Democrat” or “lean Republican”), the favorite won 83 percent of the time (25 of 30 races) when they were supposed to win only two-thirds of the time (20 of 30). And in “likely” races, favorites had a 94 percent success rate when they were supposed to win 86 percent of the time. Based on measures like a binomial test, it’s fairly unlikely that these differences arose because of chance alone and that favorites just “got lucky”; rather, they systematically won more often than expected.

Here’s the catch, though: As we’ve emphasized repeatedly, polling errors often are systematic and correlated. In many elections, polls are off by 2 or 3 points in one direction or another across the board — and occasionally they’re off by more than that. (The polling error in 2016 was actually pretty average by historical standards; it was far from the worst-case scenario.) In years like those, you’re going to miss a whole bunch of races. This year, however, polls were both quite accurate and largely unbiased, with a roughly equal number of misses in both directions. The probabilities in our forecasts reflect how accurate we expect our forecasts to be, on average, across multiple election cycles, including those with good, bad and average polling. Another way to look at this is that you should “bank” our highly accurate forecasts from this year and save them for a year in which there’s a large, systematic polling error — in which case more underdogs will win than are supposed to win according to our model.

With that said, there are a couple of things we’ll want to look at in terms of keeping those probabilities well-calibrated. One huge benefit to people making forecasts in the House this year was the series of polls conducted by The New York Times’s The Upshot in conjunction with Siena College. These polls covered dozens of competitive House races, and they were extremely accurate. Especially combined with polls conducted by Monmouth University, which also surveyed a lot of House districts, election forecasters benefited from much richer and higher-quality polling than we’re used to seeing in House races. In theory, our forecasts are supposed to be responsive to this — they become more confident when there’s more high-quality polling. But we’ll want to double-check this part of the calculation; it’s possible that the forecast’s probabilities need to be more responsive to the volume of polling in a race.

OK, now for the part that critics of FiveThirtyEight will love, as will people who just like underdog stories. Here’s a list of every upset as compared to our forecasts — every race where any candidate with less than a 50 percent chance of winning (in any one of the three versions of our model) actually won:

The biggest upsets of 2018

Races in which at least one version of the FiveThirtyEight model rated the eventual winner as an underdog

		Winning Party’s Chances
Race	Winner	Lite	Classic	Deluxe
OK-5	D	7.6%	14.3%	6.6%
SC-1	D	20.0	9.4	8.6
FL-Gov	R	18.7	22.8	22.2
CA-21	D*	27.6	21.0	16.1
NY-11	D	25.0	23.7	20.3
FL-Sen	R	27.8	29.6	26.8
IN-Sen	R	29.0	28.2	38.3
VA-2	D	29.4	32.7	40.6
OH-Gov	R	33.5	40.5	38.1
IA-Gov	R	42.2	42.7	36.4
KS-2	R	49.7	38.2	35.9
MN-1	R	51.7	44.2	40.1
TX-7	D	41.0	52.2	44.7
VA-7	D	44.3	43.7	52.0
NM-2	D	44.8	44.4	52.2
MO-Sen	R	46.5	43.1	52.8
KS-Gov	D	53.6	42.8	50.0
GA-6	D	56.8	49.1	40.6
TX-32	D	64.2	37.2	46.3
NC-9	R*	51.8	52.4	45.0
CA-25	D	31.9	63.7	55.9
CA-39	D	44.0	58.2	51.8
FL-26	D	48.8	55.8	50.2
NY-22	D	43.8	52.2	60.4
KY-6	R	47.9	54.3	57.3
CA-48	D	41.7	56.6	62.8
PA-1	R	47.2	57.2	59.4
IL-6	D	54.7	48.6	62.0
VA-5	R	48.6	53.8	70.0
SD-Gov	R	44.6	63.1	65.4

Although DeSantis’s win in the Florida gubernatorial race was the highest-profile (and arguably most important) upset, it wasn’t the most unlikely one. Instead, depending on which version of our model you prefer, that distinction belongs either to Democrat Kendra Horn in winning in Oklahoma’s 5th Congressional District or to another Democrat, Joe Cunningham, winning in South Carolina’s 1st District. Two other Democratic House upsets deserve an honorable mention: Cox (probably) winning in California 21 and Max Rose winning in New York 11, which encompases Staten Island and parts of Brooklyn. None of these upsets were truly epic, however. Horn had only a 1 in 15 chance of winning according to our Deluxe model, for instance — making her the biggest underdog to win any race in any version of our model this year — but over a sample of 506 races, you’d actually expect some bigger upsets than that — e.g., a candidate with a 1 in 50 shot winning. Bernie Sanders’s win in the Michigan Democratic primary in 2016 — he had less than a 1 in 100 chance according to our model — retains the distinction of being the biggest upset in FiveThirtyEight history out of the hundreds of election forecasts we’ve issued.

As an election progresses, I always keep a mental list of things to look at the next time I’m building a set of election models. (This is as opposed to making changes to the model during the election year, which we strongly try to avoid, at least beyond the first week or two when there’s inevitably some debugging to do.) Sometimes, accurate results can cure my concerns. For instance, fundraising numbers were a worry heading into election night because they were so unprecedentedly in favor of Democrats, but with results now in hand, they look to have been a highly useful leading indicator in tipping our models off to the size of the Democratic wave.

Here are a few concerns that I wasn’t able to cross off my list, however — things that we’ll want to look at more carefully before 2020.

Underweighting the importance of partisanship, especially in races with incumbents. A series of deeply red states with Democratic incumbent senators — Indiana, Missouri, Montana, North Dakota, West Virginia — presented a challenge for our model this year. On the one hand, these states had voted strongly for Trump in an era of high party-line voting. On the other hand, they featured Democratic incumbents who had won (in some cases fairly easily) six years earlier — and 2018 was shaping up to be a better year for Democrats than 2012. The “fundamentals” part of our model thought that Democratic incumbents should win these races because that’s what had happened historically — when a party was having a wave election, the combination of incumbency and having the wind at its back from the national environment was enough to mean that almost all of a party’s incumbents were re-elected.

That’s not what happened this year, however. Democratic incumbents held on in Montana and West Virginia (and in Minnesota’s 7th district, the reddest congressional district held by a Democratic incumbent in the House) — but those wins were close calls, and the incumbents in the Indiana, Missouri and North Dakota Senate races lost. Those outcomes weren’t huge surprises based on the polls, but the fundamentals part of the model was probably giving more credit to those incumbents than it should have been. Our model accounts for the fact that the incumbency advantage is weaker than it once was, but it probably also needs to provide for partisanship that is stronger than it was even six or eight years ago — and much stronger than it was a decade or two ago.

The house effects adjustment in races with imbalanced polling. Our house effects calculation adjusts polls that have a partisan lean — for instance, if a certain pollster is consistently 2 points more favorable to the Republican candidate than the consensus of other surveys, our adjustment will shift those polls back toward Democrats. This is a longstanding feature of FiveThirtyEight’s models and helps us to make better use of polls that have a consistent partisan bias. This year, however, the house effects adjustment had a stronger effect than we’re used to seeing in certain races — in particular, in the Senate races in Missouri, Indiana and Montana, where there was little traditional, high-quality polling and where many polls were put out by groups that the model deemed to be Republican-leaning, so the polls were adjusted toward the Democrats. In fact, Missouri and Indiana were two of the races where Republicans beat our polling average by the largest amount, so it’s worth taking a look at whether the house effects adjustment was counterproductive. When we next update our pollster ratings, we’ll also want to re-examine how well traditional live-caller polls are performing as compared with other technologies.

CANTOR forecasts in races with little polling. As I mentioned, the Lite version of our model tended to overestimate Democrats’ vote share in deeply blue districts. This overestimation was based on our CANTOR algorithm, which uses polls in races that do have polling to extrapolate what polls would say in races that have little or no polling. This wasn’t a very consequential problem for projecting the number of seats each party would win, since it only affected noncompetitive races. But it did lead the Lite model to slightly overestimate the Democrats’ performance in the popular vote. To be honest, we don’t spend a ton of energy on trying to optimize our forecasts in noncompetitive races — our algorithms are explicitly designed to maximize performance in competitive races instead. But since this was the first year we used CANTOR, it’s worth looking at how we can improve on it, perhaps by using techniques such as MRP, which is another (more sophisticated) method of extrapolating out forecasts in states and districts with little polling.

Implementing a “beta test” period. We did quite a bit of debugging in the first week or two after our House model launched. The most consequential fix was making the generic ballot polling average less sensitive after it was bouncing around too much. None of these involved major conceptual or philosophical reimaginations of the model, and they didn’t change the top-line forecast very much. Still, I think we can do a better job of advertising to you that the initial period after forecast launch will typically involve making some fixes, perhaps by labelling it as a beta period or “soft launch” — and that we should be exceptionally conservative about making changes to the model once that period is over. As much as you might test a model with data from past elections to see how it’s handling edge cases, there’s a certain amount you only learn once you’re working with live data and seeing how the model is reacting to it in real time, and getting feedback from readers (that means you, folks!), who often catch errors and idiosyncrasies.

The election night model. Last but not least, there was our election night forecast, which started with our final, pre-election Deluxe forecast but revised and updated the forecast as results started to come in. Indeed, these revisions were pretty substantial; at one point early on election night, after disappointing results for Democrats in states such as Kentucky, Indiana and Florida, the Democrats’ probability of winning the House deteriorated to only about 50-50 before snapping back to about what it had been originally.

I have some pretty detailed thoughts on all of this, which you can hear on a “model talk” podcast that we recorded last month. But the gist of it is basically four things:

First, to some extent, this was just a consequence of which states happened to report their results first. Amy McGrath’s loss in Kentucky 6 was one of the most disappointing results of the evening for Democrats, and in Senate races, Democrat Joe Donnelly underperformed his polls in Indiana, as did Bill Nelson in Florida. Those were the competitive races where we started to get a meaningful number of votes reported early in the evening. Conversely, it took quite a while before any toss-up House or Senate races were called for Democrats. Maybe our model was too aggressive in reacting to them, but the early results really were a bit scary for Democrats.
Second, election night models are tough because there are risks in accounting for both too little information and too much. Our model mostly waited for states where races had been “called” (projected by the ABC News Decision Desk) or where a large portion of the vote was in, so it was still hung up on Kentucky, Florida and Indiana even after initial returns in other states were more in line with the polls. If we had designed the model to look at county- or precinct-level data in partially-reported states instead of just the top-line results and calls, it might not have shifted to the GOP to the same degree. But the risk in that is that data feeds can break, and the more complicated the set of assumptions in a model, the harder it is to debug if something seems to be going wrong.
Third — and this is not just a challenge for election night models but for all journalists covering the election in real time — early voting and mail balloting can can cause the initial results to differ quite a bit from the final tallies. In California and Arizona, for instance, late-reported mail-in ballots tend to be significantly more Democratic than the vote reported on election evening. This didn’t matter much to our model’s swings early in the evening, but it contributed to the model being too somewhat too conservative about Democratic seat gains later on in the night.
And fourth, election night models are inherently challenging just because there isn’t any opportunity for debugging — everything is happening very fast, and there’s not really time to step back and evaluate whether the model is interpreting the evidence correctly or instead is misbehaving in some way. Our solution to the model’s oversensitive initial forecasts was to implement a “slowdown” parameter that wasn’t quite a kill switch but that allowed us to tell the model to be more cautious. While this may have been a necessary evil, it wasn’t a great solution; our general philosophy is to leave models alone once they’re launched unless you know something is wrong with them.

The thing you might notice is that none of these challenges are easy to resolve. That doesn’t mean there can’t be improvements at the margin, or even substantial improvements. But election night forecasts are inherently hard because of the speed at which election nights unfold and the sometimes-uneven quality of returns being reported in real time. The chance that a model will “break” is fairly high — much higher than for pre-election forecasts. As long as news organizations that sponsor these models are willing to accept those risks, they can have a lot of news value, and even with those risks, they’re probably superior to more subjective ways of evaluating results as they come in on election night. But the risks are real. As in any type of breaking news environment, consumers and journalists need think of election night reporting as being more provisional and intrinsically and unavoidably error-prone than stories that unfold over the course days or weeks.

Finally, a closing thought for those of you who have made it this far. The 2018 midterms were FiveThirtyEight’s sixth election cycle (three midterms, three presidential years) — or our ninth if you want to consider presidential primaries as their own election cycles, which you probably should. We actually do think there’s enough of a track record now to show that our method basically “works.” It works in two senses: first, in the sense that it gets elections right most of the time, and second, in the sense that the probabilistic estimates are fairly honest. Underdogs win some of the time, but not any more often than they’re supposed to win according to our models — arguably less often, in fact.

That doesn’t mean there aren’t things to work on; I hope you’ll see how meticulous we are about all of this. We’re interested in hearing critiques from other folks who are rigorous in how they cover elections, whether that coverage is done with traditional reporting, with their own statistical models, or with a technique somewhere in between reporting and modelling like the excellent and very accurate forecasts published by the Cook Political Report.

But we’re pretty tired of the philosophical debates about the utility of “data journalism” and the overwrought, faux-“Moneyball” conflict between our election forecasts and other types of reporting. We probably won’t be as accurate-slash-lucky in 2020 as we were in 2018, especially in the primaries, which are always kind of a mess. But our way of covering elections is a good way to cover them, and it’s here to stay.

Footnotes

I’m not sure that’s a great assumption in the case of Harris — a new election seems fairly likely — but we had that race as a toss-up anyway in our pre-election forecast, so it doesn’t matter much from a forecast-evaluation standpoint.
Including in states that didn’t hold elections this cycle.
Population projections are based on the U.S. Census Bureau’s 2016 American Community Survey and U.S. population clock; we attempted to project each state’s population as of Nov. 6, 2018, based on its recent growth rate.
Races are categorized by each party’s chances of winning, rather than each candidate’s chances.

Nate Silver reviews the FiveThirtyEight Midterm forecasts

Footnotes

Comments