Bernie Sanders’s win in Michigan last week was a massive upset relative to the pre-election polls of the state’s voters, which had shown Hillary Clinton ahead by an average of 21 percentage points. In fact, Sanders may have pulled off the biggest upset in the history of primary polling, eclipsing the previous record from 1984, when Gary Hart beat Walter Mondale in New Hampshire despite having trailed him by 17 percentage points.
When you consider Michigan’s demographics, however, the result wasn’t all that shocking. Michigan Democrats are fairly liberal and the state has a lot of college students — both factors that help Sanders. We aren’t just making this up as we go along; last month, we published state-by-state targets for the Clinton-Sanders race based on a few simple demographic variables in each state: specifically, its racial composition, how liberal or conservative it was, and how rural it was. Those targets had Sanders ahead of Clinton by 4 percentage points in Michigan.
Does that mean we called the upset in Michigan weeks ahead of time? No, we weren’t quite that good or lucky. The targets were based on a hypothetical race in which Clinton and Sanders were each winning about half the vote and half the delegates nationally. Since Clinton is ahead of Sanders nationally, she still would have been favored in our model (although not by the blowout margin that polls suggested).
Either way, the big gap between polls and demographics makes us nervous, especially because three more Midwestern states are voting today, including Ohio, where Clinton leads Sanders by about 11 percentage points in the polls. Historically, a margin like that would be quite safe: hence our polling model’s conclusion that Clinton is a 97 percent favorite. But after what just happened in Michigan? I’d love to drop a few bucks on Sanders if a bookmaker offered 30-to-1 odds against him, as our polling model does.
Fortunately, even if the polls haven’t been great, the conditions1 are potentially favorable for making demographic forecasts of the Democratic race. In 2008, under similar circumstances, I made demographic-based predictions of the Democratic race — see here for my North Carolina prediction, for example — which often outperformed the polls.
Those predictions in 2008 were based on regression analysis. They took advantage of the fact that Democrats report their vote by congressional district, which makes the sample more robust; by the time North Carolina voted eight years ago, for instance, hundreds of diverse congressional districts had already weighed in. So we’re overdue to apply the same technique this year.
In contrast to the demographic benchmarks we set in February, which were based on polling data, these are based on actual votes so far, aggregated across congressional districts. We can then compare these votes against demographic and attitudinal variables in each congressional district. For a more technical description of the analysis, see the footnotes.2 But basically, we’re just looking for sensible variables that have done a good job of explaining the split in the vote between Clinton and Sanders so far. The ones we included in the model are as follows:
- The share of African-Americans is the best predictor of the Democratic vote to date, with Clinton performing significantly better in congressional districts with more black voters.
- Clinton also performs slightly better in districts with more Hispanic voters, although the magnitude of the effect is considerably smaller than that for black voters.
- Sanders performs better in districts that express liberal attitudes on social policy3.
- Sanders performs better in districts with major colleges, as measured by the number of people employed in postsecondary education in each district.
- As other researchers have found, Clinton performs better in the South, even after controlling for other factors.4
- Sanders performs better in districts where more voters are in labor-union households.
- Clinton performs better in districts where voters are more in favor of gun control.5
- Sanders performs better in caucuses relative to primaries, other factors held equal.
This regression analysis6 models the vote by congressional district reasonably well. We can aggregate the congressional district projections to come up with state forecasts. Here’s what they would have said about the states to have voted so far:
|RETRODICTIVE VOTE SHARE BASED ON DEMOGRAPHICS||ACTUAL VOTE SHARE|
Our demographic “retrodiction”7 for Michigan still has Clinton winning, but only barely — by 3 percentage points, compared with the actual 2-point win for Sanders. Especially under the Democrats’ proportional allocation method, that’s a pretty minor difference. The model’s retrodictions in Vermont and Arkansas are also pretty far off, as you can see, but that makes sense given potential home-state effects for Sanders and Clinton in those states.
Other results are a bit harder to explain. How did Clinton (barely) win the Iowa caucuses when she got crushed in other Midwest caucus states, like Kansas and Minnesota? How did Sanders lose Massachusetts after winning New Hampshire by so much? How did Sanders win Oklahoma by 10 percentage points?
I have my theories — Clinton’s ground game may have saved her in Iowa, for instance — but my goal isn’t to explain away every last bit of variance (in which case I’d be guilty of overfitting my model). Instead, it’s to have reasonably sensible demographic-based projections that pass the smell test when applied to future states. Here are those forecasts, starting with the five states that will vote on Tuesday:
|FORECAST BASED ON DEMOGRAPHICS AND RESULTS IN PAST PRIMARIES||“POLLS-ONLY” FORECAST|
|DATE||STATE||CLINTON||SANDERS||SANDERS WIN PROB.||CLINTON||SANDERS||SANDERS WIN PROB.|
The numbers in Ohio jump out, since they suggest — in contrast to the polls — a very close race between Sanders and Clinton. After accounting for the uncertainty in the forecasts, the demographic model gives Sanders a 42 percent chance of winning Ohio, much better than the 3 percent chance that our “polls-only” forecast gives to him.
The news isn’t as good for Sanders in Missouri. There, the demographic model concludes that polls showing the race to be essentially tied are slightly too generous to Sanders; it forecasts Clinton to win by 9 percentage points.
In Illinois, the polls have been all over the place, with recent surveys showing everything from a 42-point lead for Clinton to a 2-point lead for Sanders. Our weighted polling average has Clinton up by 7 points there, and the demographic model is largely in agreement, forecasting a 9-point win for Clinton.
Listen to the latest episode of the FiveThirtyEight elections podcast.
Finally, both polls and demographics imply that Clinton is likely to win by blowout margins in North Carolina and Florida. If Sanders were to win or come close in one of those states, it would be an even bigger upset than Michigan and would suggest that something fundamental had changed in the Democratic race.
For clarity: These are forecasts based on the results so far, as opposed to benchmarks of what might happen in a hypothetical 50-50 race between Clinton and Sanders. If the candidates hit their forecasts on the nose in every state, Clinton would wind up winning by about 10 percentage points nationally. Thus, Sanders needs to substantially beat and not just tie these numbers to have a shot at the nomination. If you like, you can turn them into benchmarks by adding a net of 10 percentage points to Sanders. For instance, while the forecast in Connecticut is Clinton +3, the benchmark would be Sanders +7.
Since Sanders has lost ground to Clinton in the states to have voted so far, however, even that would not suffice for him to win the nomination; he’d have to beat these forecasts by something like 15 percentage points instead. It would be pretty shocking — but then again, Sanders has proven he can win when the odds are against him.