We don’t know as much as we think (Big Change #1)

There are two major changes to my methodology, which you already see reflected in the new charts and graphics that are presently on the site. This is the easier of the two to explain, so let’s handle it first.

Andrew Gelman of Columbia University was kind enough to share some of his old national polling data with me. His dataset runs from 1952 through 1992. I took his data from 1988 and 1992 (before 1988, there are only a limited number of polls available), then combined it with the data I already had for 2000 and 2004, and tracked down some 1996 data in a magical place on the internet.

If you had looked back at the polls in June in the five previous election cycles, what would you have found?

In 1988, Michael Dukakis was ahead by an average of 8.2 points in 5 June polls. In November, George Bush won by 7.8 points.

In 1992, George Bush was ahead by an average of 4.9 points in 14 June polls. In November, Bill Clinton won by 5.6 points.

I don’t actually have any June polls for 1996 (if anybody’s sitting on a big stash of Clinton-Dole data, you know where to find me). But in Gallup’s July poll, Bill Clinton led by 17 points. In November, Clinton won by 8.5 points.

In 2000, George W. Bush was ahead by an average of 4.7 points in 14 June polls. In November, Al Gore won the popular vote by 0.5 points.

In 2004, John Kerry was ahead by an average of 0.9 points in 16 June polls (this was pretty much his high-water mark all year). In November, George W. Bush won by 2.4 points.

So in four out of the last five elections, an average of June polls would have incorrectly picked the winner of the popular vote. That’s kind of a problem for anybody who is overly confident about how this election is going to turn out.

Previously, I had modeled the error in our polling averages based on 2004 data (simply because that’s the data I had access to). The issue with that is that the polls were unusually stable in 2004. From April onward, John Kerry never held a lead of more than about 2 points in the Real Clear Politics national average, and George W. Bush never held a lead of more than 6 or 7 points. Those numbers pretty well framed the actual result of Bush +2.4. But as the Gelman data reveals, there was much more fluidity in previous years, and so modeling the error based on 2004 data alone would lead one to underestimate the degree of uncertainty inherent in a general election.

My error estimates are now modeled instead on the 1988-2004 dataset. I do give somewhat more weight to more recent cycles, as in general, the polls have tended to get closer to the actual margin more quickly in recent years. There are a few reasons to think this might not be an accident. For example, (i) the country has tended to get more partisan over time, meaning that there may be fewer true undecided voters than there used to be; (ii) with the proliferation of the Internet and cable news, voters now have more information about the candidates sooner than they used to, and (iii) the science of polling has probably improved over time. Nevertheless, we are accounting for quite a bit more error than we had been before.

If I look at the total miss for each poll based on the number of days until the election, I get the following, very pretty graph:

There is quite a lot of noise there, but the error can be modeled reasonably well as a function of the square root of the number of days until the election. Specifically, the curve I use looks like this:

Presently, with about 145 days to go until Election Day, we would anticipate that a typical national poll will be off my around 6-7 points. We do not know, unfortunately, which direction that miss is likely to be in. But there is reason to believe that the range of possible outcomes — including scenarios where the election doesn’t turn out to be especially close — is wider than we had been assuming before.

FiveThirtyEight

We don’t know as much as we think (Big Change #1)

Comments