FiveThirtyEight

Here at FiveThirtyEight, we’ve never built a complete back-to-front model of the presidential primaries before. Instead, in 2008, 2012 and 2016, we issued forecasts of individual primaries and caucuses on piecemeal basis, using polls and demographics. We always thought there were too many complexities involved — how the outcome in one state can affect the next one, for example — to build a full-fledged primary model.

But this year, we’re giving it a shot. We’ve built a forecast that plays out the outcome of the 57 delegate-selection contests (50 states, D.C., five territories and Democrats Abroad) that Democrats will contest this year, simulating polling swings, post-primary “bounces” and candidates dropping out, starting with Iowa on Feb. 3 and ending with the Virgin Islands on June 6. We don’t try to anticipate what would happen in the event of a contested convention or if there are other complications in how delegates are chosen after June 6. But this is still a pretty ambitious project.

Why build a fancy primary model when we hadn’t before? Well, for one thing, there’s actually a lot more data available now than when we launched FiveThirtyEight 12 years ago. The Democratic primaries in 2008 and 2016, and the Republican ones in 2012 and 2016, were all long contests that give us more information on how the latter stages of the primary process play out. Since the current presidential nomination system is a relatively new invention — before 1972, voters had little direct say in how candidates were chosen — the data from these recent elections reduce the degree of difficulty in building a primary model. It’s still pretty hard, but it’s no longer an intractable problem.

Also, I suppose we’re feeling frisky these days. If building a full-fledged primary model presents its share of challenges — some of which I’ll describe here — there are also plenty of problems with publishing a half-assed forecasting product. (Meanwhile, trying to navigate our way through the primaries without any sort of forecasting product would present bigger challenges still.)

Before I run through the steps the model takes, here are a few key things to keep in mind. Even if you read nothing else about our model, please do read these. They’ll likely answer a few questions — or complaints — that you might have later on.

1. Our model is a forecast; it is not an estimation of what would happen in an election held today. Forecasted results in later states reflect “bounces” from earlier states and other contingencies. For example: Upon launch, our model gives former Vice President Joe Biden only about a 60 percent chance of winning Delaware, his home state. Why only 60 percent? Isn’t Biden hugely popular there? Well, yes. And Biden would almost certainly be a massive favorite in Delaware if it were the first state to vote. But in reality, Delaware votes relatively late in the process, on April 28. And there’s the chance that Biden will have dropped out by that time, or that his campaign will otherwise be severely diminished. The model accounts for these possibilities.

2. Our forecast is probabilistic. The degree of uncertainty in the primaries is high, and the process is path-dependent and nonlinear. The nomination process consists of layers of uncertainty piled on top of one another. Just looking ahead at January and February, for instance, there’s the chance the race could shift in the final few weeks before Iowa. Then Iowa itself is not very easy to forecast. Then whatever happens in Iowa will have uncertain effects on New Hampshire. And so forth.

But it’s not as though we’re totally in the dark, either. Candidates who poll well in the run-up to the primaries are much more likely to win the nomination than those that don’t. If you hear things like “the primaries are unpredictable,” what does that mean, exactly? Does it mean that former Rep. John Delaney and author Marianne Williamson are as likely to win the nomination as Biden and Sen. Bernie Sanders? If that’s what you think, you know where to find me for a friendly wager.

In other words — like most things in life — the primaries exist somewhere along the spectrum between predictable and unpredictable. The model’s job is to sort all of this uncertainty out. And we encourage you to take probabilities we publish quite literally. A 60 percent chance of a candidate winning a particular state means that she’ll win it six out of 10 times over the long run — but fail to do so four out of 10 times. Historically, over 10 years of issuing forecasts, the probabilities that FiveThirtyEight publishes really are quite honest, i.e. our 60 percent probabilities really do occur about 60 percent of the time. With 57 primaries and caucuses to come, there will probably be some big upsets, and it’s likely that a candidate with a 5 percent chance or a 2 percent chance or even an 0.3 percent chance of winning a state will surprise us somewhere along the line.

Because of the path-dependent nature of the primaries — events in one state can affect the results in the next ones — the probability distributions our model generates can be pretty weird-looking. For instance, as of Jan. 7, here’s the range of possible outcomes that our model shows for Sanders in Ohio:

What’s going on here? Why the concentration of outcomes near zero percent? Those cases represent the chance — about one in three, our model figures — that Sanders will drop out at some point before Ohio. If he hasn’t dropped out at that point, Sanders figures to do decently well, on the other hand, most likely winning somewhere between 15 percent and 35 percent of the vote. But there’s also the chance that Sanders will be just one of two or three major candidates left by the time Ohio votes. If that’s the case, Sanders could win 50 percent or 60 percent of the vote there, or more. When you see the probabilities in our model, remember that they reflect this variety of possibilities.

3. Our model forecasts the chance of winning the plurality and majority of pledged delegates — it’s not a forecast of the nomination per se. As I mentioned — but it’s worth repeating — our model is not technically a forecast of the outcome of the nomination process. It does not purport to predict who will accept the Democratic nomination in Milwaukee. Rather, it forecasts the selection of delegates in 57 primaries and caucuses and tallies up the chance that each candidate will have a majority or a plurality of delegates after the last caucusgoers vote on June 6. It does not reflect such contingencies as:

The model does reflect one complication: how certain delegates may be automatically reassigned if a candidate drops out before those delegates are chosen. Specifically, if a state has not yet assigned statewide delegates and a candidate who was entitled to delegates has dropped out, those delegates will be distributed among candidates who remain in the race. (District-level delegates are not reassigned, however.) This gives the leading candidates a little bit of a cushion; they can wind up with a majority of delegates on June 6 even if they wouldn’t have won one based on the delegates as originally assigned.

4. The primary model is complex — which isn’t entirely a good thing. The primary model is probably the most complex one that we’ve ever built at FiveThirtyEight. That’s not quite the same thing as it having the most lines of code or the most component parts. (Our midterms model had about twice as much code, for instance.) Rather, it’s the most complex in a mathematical sense. It involves a lot of path-dependency and a lot of nonlinearity, and the candidates’ performances can interact with one another in fairly complicated ways.

As we see it, we don’t really have much choice in the matter. Our primary model is necessarily complex because the primaries themselves are a complex process.

Still, the complexity of the model means that there’s more structural uncertainty than there might be in our other forecasting products. To put it another way, there’s a higher chance than usual that we’ve gotten the “physics” of the system wrong and mis-designed the model. Bugs in our code — or inaccurate data — could also have larger effects than they would in another model.

Therefore, especially in the first week or two after model launch, we will be on the lookout for potential errors and will make changes to the code if we’ve messed something up. And please let us know if you see something that looks awry.

Alrighty then. What follows is an overview of the major steps the model takes. We’re keeping this high-level in some places as we plan to publish a series of articles on several of these steps, many of which are pretty interesting in their own right and have implications beyond the design of the model itself.

Step 1. Calculate national and state polling averages and translate them into a polling-based forecast

We released our national and state polling averages in December, which you can read about at more length here. These averages are also the first step in the primary model. Again, go read the article about the polling averages for more, but here are the highlights:

There’s one additional step in translating polls into a polling-based forecast for each race: We need to account for undecided voters. A candidate polling at, say, 17 percent in a certain state is likely to wind up with more than 17 percent of the actual vote there because she can expect to pick up a few undecided voters.

There are two ways that one might do this: Undecideds could be allocated proportionally between the candidates (so a candidate polling at 10 percent of the decided vote would pick up twice as many undecideds as one polling at 5 percent), or they could be divided evenly. Empirically, we find that that a proportional allocation works best until candidates reach 20 percent of the vote, at which point a combination of a proportional and an even distribution works best instead. So that’s what’s reflected in the model.

Step 2. Calculate a non-polling forecast for each state based on demographics, geography and state fundraising

Many primary states have little or no polling — so we need an alternative way to forecast the outcome in those states. Technically speaking, we actually have two such methods:

Of these, the demographic method is considerably more precise and it receives considerably more weight in the FiveThirtyEight model. A few further notes about each method:

The geographic method

The home-state advantage is fairly large in the primaries. In some cases, a candidate might be expected to outperform his or her national polling by 20 points or more in their home state, in fact. However, there are a couple of complications:

If a candidate is commonly associated with more than one state, his or her home state advantage may be divided between any number of primary and secondary home states. For 2020, the following candidates have multiple home states:

In these cases, 75 percent of the candidate’s home-state advantage is assigned to the primary state and 25 percent to the secondary state.

In addition to home-state effects, we also account for a candidate’s home region (Northeast, Midwest, South, West), subregion (e.g. New England, the Great Lakes) and whether or not his or her home state shares a border with a state holding a primary or caucus. These regional effects can be fairly profound; it’s a big help to be from New England if you’re competing in a primary in New England. Some states are divided between multiple regions or subregions, a fun topic about which we’ll have more to say later, but as a preview, here’s the basic schematic we’re using.

Finally, the relative amount of money that a candidate raises in a particular state can tell us something about his or her geographic strengths. By relative amount, I mean what share of the total itemized contributions (among all Democratic candidates) a candidate raised in that state, as compared to how he did in other states. Sanders is a prolific fundraiser everywhere, for instance. But if he raised a smaller share of money in, say, Arizona than he did on average nationally, the model would assume Arizona was a below-average state for him.

The demographic method

As I mentioned, the demographic method works by running a series of regressions, where the independent variables are a series of demographic and political variables that are generally predictive of primary outcomes, and the dependent variable is a candidate’s polls or results in each state.

A major challenge is that there are a lot of plausible variables that affect primary and caucus outcomes — but relatively little data to test them out on, especially early in the process when no states have voted and few states have reliable polling. In cases like these, regression analysis is at risk of overfitting, where it may seemingly describe known data very precisely but won’t do a good job of predicting outcomes that it doesn’t know in advance.

Our solution is twofold. First, rather than just one regression, we run a series of as many as 360 regressions using all possible combinations of the variables described below. Although the best-performing regressions receive more weight in the model, all of the regressions receive at least some weight. Essentially, the model is hedging its bets on whether the variables that seemingly describe the results the best so far will continue to do so. Secondly, the model restricts how many of the 360 regressions it runs based on how much data it has. As of launch in January, for instance, because data is relatively sparse, the model will only run regressions containing three or fewer variables. The degree of complexity will increase as more data becomes available.

The variables the regressions evaluate are as follows:

The 360 regressions reflect all possible combinations of which of these categories are included and, if included, how they are specified. Regression No. 286, for example, consists of religiosity, urban-rural and the liberal-conservative balance as specified based on the Clinton-Sanders vote, but doesn’t use the other variables.

Blending the demographic and geographic methods with the polls

Next, the model blends the geographic methods together with the demographic method, with the demographic method receiving considerably more weight.

Finally, the model blends the combined demographic/geographic forecast together with the polling-based forecast from Step 1 (if the state has polling). The model gives more weight to the polling when there is a larger volume of recent polling. Conversely, the model will give more weight to demographics if it deems the demographic analysis to be reliable based on how much data it has and how much the various regressions agree with one another.

In general, however, the model mostly defaults toward polling when there is a decent amount of polling available. In fact, a state’s forecast will be entirely based on polling if there is a lot of recent polling there.

Step 3. Begin simulating the rest of the primary process, starting with day-to-day movement in the polls

Everything up to this point essentially represents the model’s snapshot of the election as it looks today. But, of course, what the election will look like weeks or months from now is another question entirely. So the model simulates the nomination process thousands of times to get an idea of the range of possible outcomes. (As a default, we run 10,000 simulations, although we may run a larger or smaller number in some circumstances.)

The biggest swings in candidates’ standing in the polls tend to come in the form of bounces following major primaries and caucuses, which we cover in Step 5. However, polling swings sometimes do occur for other reasons.

Therefore, the model plays out the rest of the primary process one day at a time in each simulation. It performs three subtle calculations at the end of each day:

Here’s a bit more detail on this process.

Primary polling swings are stochastic and nonlinear

If you examine charts of primary polls, you’ll generally find long periods of stasis punctuated by occasional, relatively abrupt swings. That’s because news events — or at least, the sort of news events that have the potential to affect primary polls — occur in irregular doses. Google searches of primary-related queries, for instance, reveal occasional, news-related spikes where interest in the primary campaign is several times higher than the baseline. These spikes are often associated with debates, but can sometimes occur for other reasons.

To simulate this, the model randomly draws a “news impact multiplier” for each day from a right-tailed distribution where the average day has a multiplier of 1, but the multiplier may be as low as 0.1 (a day where basically nothing happens) or as high as 6 (where a single news event can swing the polls as much as a month’s worth of regular campaigning). In this way, the model accounts for the small but non-zero possibility of shocks to the polls that occur essentially overnight as voters react to major news events.

We also account for debates using this framework. Rather than being occasional, unpredictable news events, debates are tantamount to guaranteed, medium-impact news events. Specifically, our research finds that debates move the polls by roughly as much as one week’s worth of ordinary campaigning, which is equivalent to a news multiplier of about 2.7. Thus, on dates when debates are scheduled to occur, the model adds 2.7 to the news impact multiplier for candidates who are scheduled to participate in the debate. Put another way, debates introduce volatility into the primary process, giving trailing candidates (if they’re able to participate in the debates) a slightly better chance of catching up to the front-runners.

How candidates can get stuck in the low single digits

Another important property of primary polls is that the degree of volatility in a candidate’s polls is related to his or her standing. Empirically, for instance, a candidate who is polling at 3 percent sees their polls move by only about one-fifth as much from day to day as one polling at 30 percent.

What this means is that if a candidate is polling in the low single digits, they tend not to see very big swings in the polls. For a candidate to break out of the low single digits and become a front-runner is quite rare in the primaries, therefore. It usually involves multiple stages, that is, the candidate initially breaks out from, say, 3 percent to 8 percent, and then gets some further boost that takes them from 8 percent to 17 percent and makes them a real contender. (For poker geeks out there: The process is somewhat analogous to a poker player with a short stack having to double her chips several times over to have a chance at winning a poker tournament.) Therefore, such breakouts are also rare in our model, although they may occur once per several hundred simulations.

Two final technical notes:

Mean-reversion and “the fundamentals”

In general, it’s hard to outguess the polls in the primaries. That doesn’t mean the primaries are terribly predictable or that polls are especially reliable. But the candidates who rise in polls aren’t always the ones that pundits expect. As a case in point, there were a lot of theories — some with fairly solid empirical backing, some not, many of which we ourselves subscribed to — about why President Trump’s lofty poll standing in the early stages of the 2016 Republican primary wouldn’t translate into his winning the nomination. In the end, those polls were accurate and Trump won, of course.

Still, even accounting for Trump’s victory, our analysis finds that there is some additional predictive power provided by quantifiable, non-polling factors. Specifically, the model accounts for:

In testing the forecast, we found which of these factors best predicted polling movement was sensitive to exactly how we specified the model. That is, under some tests, fundraising was highly predictive of polling movement while endorsements were not, while under other tests, the reverse was true. Therefore, we give the three categories equal weight.

As of Jan. 8 (just before model launch), the fundamentals calculation for each candidate was as follows:

Biden leads in the combined fundamentals

The Democratic candidates by the fundamentals, as of Jan. 8, 2020

View more!

Note that this calculation can be compared directly to the polls. For instance, if the fundamentals calculation implies that Sen. Amy Klobuchar “should” have 7 percent of the vote and she actually has 4 percent (after allocating undecided voters), that would imply that she’s more likely to rise in the polls than to fall. To reiterate, however, the model takes a gentle hand with the fundamentals, so this has fairly marginal effects on the overall forecast.

Step 4. Simulate state and district results — accounting for uncertainty — and allocate delegates

As the model runs through each day of the nomination process, it will occasionally encounter something exciting: a day on which one or more states are holding primaries or caucuses. But forecasting the outcome of these races is another challenge.

The biggest problem is that polls of the primaries just aren’t all that accurate. In 2016, the average primary poll missed the margin between the top two candidates by about 10 percentage points! And that’s the average poll; the margin of error is considerably wider. Leads of 15 or sometimes even 20 points may not be entirely safe in the primaries. Nonetheless, however imperfect polls are, it’s usually easier to forecast primary results with polls than without them, so states with little or no polling present even bigger problems than states with lots of potentially flawed polling.

So the model uses a series of empirically derived heuristics to determine whether its forecast is relatively more or relatively less reliable. The forecast is more precise when:

So while there’s always uncertainty in the forecast, a prediction about the Florida primary — a large, diverse state that usually gets a lot of polling — is liable to be more precise than one about the Guam caucus.

Simulating district results and assigning delegates

Roughly two-thirds of the pledged delegates in the Democratic primaries are not actually assigned at the statewide level. Rather, they are distributed based on the results in individual districts. Most states allocate delegates by Congressional district, but there are some exceptions:

The model estimates the results in individual districts using a variation of the regression technique described in Step 2. That is, we’ve already built a demographic forecast, so just as we can forecast how a candidate’s performance in Illinois might differ from that in Indiana, we can also forecast which districts in Illinois or Indiana are likely to be relatively strong or weak for a candidate. (With that said, the district forecasts are not all that precise.)

Once the model simulates the results in each state and district, it then assigns delegates based on these results. Unlike Republicans, Democrats use highly consistent rules from state to state to allocate delegates, and the rules are fairly straightforward. The key thing to keep in mind is the 15 percent threshold — that is, under Democratic rules, a candidate must receive at least 15 percent of the vote in a given state or district to be eligible to receive delegates there. (Unless no candidate gets 15 percent of the vote, in which case the threshold is lower.) This makes a reasonably large difference. In a state where, say, Sanders got 40 percent of the delegates and Warren got 30 percent, with the rest of the vote divided several ways between a number of candidates, Sanders and Warren could get 100 percent of the delegates (or close to it) despite “only” having 70 percent of the vote combined. This is one reason that a contested convention, while certainly possible, is less likely than some people might assume.

Ranked-choice voting and caucus viability thresholds

A few states do have special rules, however. Alaska, Hawaii, Kansas and Wyoming use ranked-choice voting, so that voters are reallocated based on their second choices until all remaining candidates have at least 15 percent of the vote. To simulate this, the model imputes voters’ second choices based on what is essentially a nearest neighbors algorithm or what we call a proximity rating, where candidates who are more similar along ideological and other dimensions are assumed to be more likely second choices for voters (see Step 6 for more on this).

The Iowa and Nevada caucuses take this process one step further by literally having voters physically realign themselves into preference groups until all candidates achieve a viability threshold, which is usually 15 percent of the vote (although it can be higher in precincts where a low turnout is expected). The model simulates this process also, which tends to improve the position of front-runners while making it harder for also-rans to finish with a respectable share of the vote. We’ll discuss the viability process in more detail at some point before Iowa.

Accounting for correlated errors

Although this isn’t nearly as important in the primaries as in the general election, where accounting for the correlations between states was critical to properly assessing Trump’s Electoral College odds, our model does assume that forecast error is partially correlated from state to state along demographic lines. It does this by simulating random perturbations in the demographic groups described in Step 2. In one simulation, for instance, Sanders might be predicted to outperform his polls among African-American voters, which would cause him to gain ground in that simulation in all states and districts with a large percentage of black voters.

In addition, we find that empirically, when there is more than one primary or caucus in a given day, forecast errors tend to be somewhat correlated across all states that day. For instance, if there were a number of races that were projected to be close on Super Tuesday, it wouldn’t be surprising if all or almost all of those races were won by the same candidate. The model accounts for this type of correlation also, which can affect the chances of receiving a large bounce from a multi-state primary day.

Step 5. Simulate bounces (or crashes) from winning (or losing) primaries

Probably the single most important statistical dynamic in the primaries is the bounce that candidates typically receive from winning or otherwise performing strongly in major primaries and caucuses. Likewise, candidates who underperform expectations can see their standing decline.

Bounces are a subject that we’ll frequently explore over the course of the next few months, so I’ll be relatively brief here. But here are a few heuristics the model uses to assign bounces based on our analysis of bounces following dozens of primaries and caucuses since 1980:

Which states will produce the biggest bounces?

Expected bounce magnitude according to FiveThirtyEight’s primary model

View more!

Following primary and caucus nights, the model will build in anticipated bounces based on the results as best as we can evaluate them as of late that evening or early the next morning. The model will then adjust its expectations once polls come in after the primary. The model treats the magnitude of bounces as being fairly uncertain — that is, in some simulations it will apply a relatively large bounce and in other simulations a relatively small one. Nonetheless, this process will potentially produce a lot of volatility following primary and caucus nights if the model overshoots or undershoots the bounce.

Step 6. Forecast the probability that candidates drop out

The final major step in the simulation is accounting for the possibility that candidates quit the race. On the one hand, these dropouts don’t always have an enormous direct effect on the standing of the remaining candidates. Candidates don’t usually drop out unless their situations are fairly desperate and most of their voters have already migrated to other candidates (or they never had many voters in the first place). On the other hand, the winnowing process is essential to the primaries concluding themselves in a tidy fashion without requiring a contested convention. In addition, there can be circumstances where a candidate dropping out works mostly to the benefit of one of the remaining candidates. Although these dynamics can be hard to simulate, the model does assume, for example, that a Warren dropout is more likely to help Sanders than Biden, and a Sanders dropout is more likely to help Warren than Biden, other factors held equal.

In estimating the probability that a candidate will drop out, the model considers the following factors — listed roughly in descending order of importance:

When a candidate drops out in one of the simulations, the model reassigns most of their voters to the other candidates (but returns a small share to the undecided pool). As is the case when the model is assigning undecided voters (see Step 1), the allocation is partly proportional (front-running candidates pick up more of the vote from dropped-out candidates) and partly divided equally among candidates with at last 20 percent of the vote. However, there is also another wrinkle …

Yes, Virginia, there are lanes in the primary — or at least there are proximity ratings

Despite occasional protestations to the contrary, there are pretty clearly are “lanes” in the Democratic primary. For instance, Sanders voters are considerably more likely to have Warren and Andrew Yang as second choices than you’d anticipate from chance alone, and considerably less likely to have former South Bend, Indiana, Mayor Pete Buttigieg as a second choice.

If we had comprehensive data on voters’ second (and third, etc.) choices, we could incorporate that directly into the model. Unfortunately, we do not; only a few pollsters publish this data.

As an alternative, we came up with a way to categorize each candidate along four dimensions:

While there is inherently some subjectivity in these ratings, there is plenty of data to guide the characterizations of candidates, ranging from how often the candidates voted with Trump to how many endorsements they have to whether their support comes mostly from college-educated voters. Furthermore, we found that by rating the candidates along these four dimensions, we were able to replicate second-choice data fairly accurately. Finally, we found that in applying these ratings to past primary campaigns, we were better able to predict the dynamics of those primaries, such as how the vote shifted after candidates dropped out.

Here are the ratings we wound up with for the remaining 2020 candidates:

Proximity ratings for Democrats along four dimensions
View more!

We are quite certain that some of you will have critiques of these ratings. But we think they’re a lot better than nothing in helping us to replicate the polling data, and — this is key — there is no inherent advantage in a candidate having a high score or a low score along any one of these dimensions.

Rather, it’s the proximity of a candidate to other candidates that counts. The model uses these proximity ratings in various, relatively subtle ways. But in general, candidates with high proximity to other candidates have more upside and downside in their polling (they can be everyone’s first choice and become the nominee … or they can be everyone’s third choice and flame out), whereas candidates who have carved out space to themselves have more stable numbers.


And those, ladies and gentlemen, are the major steps in our 2020 primary forecast. As new data comes in, the model will simulate the primary from that day forward and update its results. We’re hopeful, in fact, that there will be a large volume of polling soon after pollsters were largely dormant for the holidays — in which case the forecast could move a lot.

At the time we’re launching the the forecast, however, that range of possible outcomes is wide. As of 11 a.m. ET on Jan. 9, for example, Biden is forecasted to win an average of 27 percent of the vote in Iowa. But the range of most likely outcomes covers everything from Biden winning as little as 5 percent to him winning as much as 49 percent. And while Biden is the most likely nominee — with a 42 percent chance of winning a majority of delegates through June 6 — he’s an underdog relative to the field.

The model isn’t “playing it safe” by having such large confidence intervals; it’s simply capturing the uncertainty that exists in the real world. Primaries are far more fluid than general elections. There is still plenty of time before Iowa.

In the meantime, though, please let us know if you see anything amiss. We’ll be on the lookout for bugs and fine-tuning things, especially for the first one or two weeks the model is running. (Did we mention that this is the first time we’re forecasting a nomination race? 😬) We’re also planning on recording a podcast answering reader and listener questions about our primary forecast — so if you have any, please send ’em in.

UPDATE (Jan. 29, 2020, 1:59 p.m.): We found and corrected a bug in how the model was accounting for home-state and home-region effects when projecting candidates’ bounces after each primary or caucus date. The change marginally improved Biden’s overall chances.


Filed under