Are Bad Pollsters Copying Good Pollsters?

The political polling industry is a mess. Fewer and fewer people are willing to respond to telephone surveys, particularly automated ones, and the costs of live interviews are climbing ever higher. Meanwhile, polls have gained prominence in the political media (FiveThirtyEight itself is a part of this trend), and the Internet seems willing to give a home to almost any survey.

Demand is up, and quality supply is down. The result: pollsters that use nontraditional methodologies such as online and automated surveys are getting more press than ever, and they get included in the models of the main polling aggregators, including FiveThirtyEight, HuffPost Pollster and RealClearPolitics.

The problem is that many of these nontraditional polls may be cheating, adjusting their results to resemble higher-quality polls. We can see this by looking at polling from the final three weeks of Senate campaigns since 2006: in races without traditional, live-interview surveys (what we’ll call gold-standard polling), nontraditional polls have had significantly higher errors than they’ve had in races with at least one gold-standard poll. Gold-standard surveys appear to be the LeBron Jameses of the polling world: They make everyone around them better.

That’s how it’s supposed to work in basketball but not in polling, and this is a major problem for anyone watching 2014’s races. There hasn’t been a gold-standard poll released to the public at all for Alaska’s Senate race, in three months for Arkansas’s Senate race, in three months for Kentucky’s Senate race, ever in Louisiana’s likely Senate runoff, and in nearly four months for North Carolina’s Senate race. The only polls we can consider in these races were conducted by pollsters who have historically fared considerably worse as a group when the gold-standard pollsters weren’t around.

But let’s back up for a moment: What’s a nontraditional poll? One that doesn’t abide by the industry’s best practices.¹ So, a survey is nontraditional if it:

doesn’t follow probability sampling;
doesn’t use live interviewers;
is released by a campaign or campaign groups (because these only selectively release data);
doesn’t disclose (i.e. doesn’t release raw data to the Roper Archives, isn’t a member of the National Council on Public Polls, or hasn’t signed onto the American Association for Public Opinion Research transparency initiative).

Everything else is a gold-standard poll.

The FiveThirtyEight polling database has 865 polls conducted in the final three weeks² of Senate campaigns since 2006 (spread across 122 different elections). Of those, 224 polls were conducted in races without gold-standard polling:

Another 147 were conducted by gold-standard pollsters:

The remaining 494 surveys were conducted by nontraditional pollsters in races where at least one gold-standard poll was conducted in the final three weeks:

A look at this chart, which charts the polling errors for these three different groups using a local regression, makes clear the effect of having a gold-standard pollster in the field during the end of the campaign:

It’s easy to see why we have a problem. In races with no gold-standard pollster, the nontraditional pollsters have had individual polling errors about 0.6 to 4.3 percentage points higher³ than when at least one gold-standard pollster is active in the race. Gold-standard pollsters’ error rates were about 1.5 to 3.1 percentage points lower during the same period.

On average, the gold-standard polls in the final 21 days of Senate campaigns had an absolute mean error of about 3.8 percentage points. The nontraditional pollsters in those same races had an average error of 4.3 points. Those are fairly close, but when no gold-standard pollsters were active, the mean error rate for the nontraditional polls shot up to 6 percentage points.⁴

But can’t you just throw all these nontraditional polls into an average? The error rates above, after all, are for individual surveys. Sites such as FiveThirtyEight, HuffPost Pollster and RealClearPolitics average polls in some fashion in the hope of lowering the error rate. And while this works to a degree, you can’t average the nontraditional polls together to make the accuracy gap between the races where gold-standard pollsters are active and those where they aren’t disappear.

There were 47 Senate elections since 2006 in which only nontraditional pollsters were active in the final three weeks and at least two polls were taken. The mean error of the average of polls in these races was 5.1 percentage points.

On the other hand, there were 55 senate elections since 2006 in which at least one gold-standard pollster was active in the final three weeks and at least two polls of any type were taken. The mean error of the average of these polls was 3.1 percentage points.

Both of these error rates do suggest that averaging polls leads to lower error rates, but you need better polls in order to get the best predictions.

So, why do the nontraditional pollsters seem to do worse in races where there aren’t gold-standard pollsters conducting polls? The above chart looks similar to one produced by Princeton University graduate student Steven Rogers and Vanderbilt University professor of political science Joshua Clinton, who studied interactive voice response (IVR) surveys in the 2012 Republican presidential primary. (IVR pollsters are in our nontraditional group.) Rogers and Clinton found that IVR pollsters were about as accurate as live-interview pollsters in races where live-interview pollsters surveyed the electorate. IVR pollsters were considerably less accurate when no live-interview poll was conducted. This effect held true even when controlling for a slew of different variables. Rogers and Clinton suggested that the IVR pollsters were taking a “cue” from the live pollsters in order to appear more accurate.

My own analysis hints at the same possibility. The nontraditional pollsters did worse in races without a live pollster. But could something other than outright copying be going on here? I checked a few possibilities, but the effect remains even when we account for when the poll was conducted (since gold-standard and nontraditional pollsters could be in the field at different times), the turnout rate (since elections with lower turnout could be more difficult to project than those with higher turnout), and the state in which the race took place (since some states may simply be more difficult to poll than others):

Note how the coefficient for “any gold” is quite negative and the p-value “P>|t|” is well below the traditional 0.01 threshold. That means nontraditional polls taken in states where live pollsters were active had significantly lower error rates even when accounting for these other variables.

You might be wondering if the gold-standard pollsters themselves may be checking out one another’s results. I ran those regressions as well. While there may be some effect, the case here is far weaker once you control for the state in which the election was held.

None of this proves guilt, but it does raise the possibility some pollsters may be peeking at their neighbors’ papers. And to be clear, we shouldn’t avoid discussing 2014 races that don’t have gold-standard polling data; nontraditional polls still have some predictive value. But we should acknowledge that the forecasting ability of these polls in these races is considerably worse on average.

Overall, though, Senate polling since 2006 paints a potentially troubling picture for the future of public electoral polling. The gold-standard pollsters aren’t releasing much data this year and probably won’t in the future as the cost of producing top-quality surveys climbs. And the nontraditional pollsters simply haven’t performed as well as a group when gold-standard pollsters aren’t around.

Footnotes

Every pollster who doesn’t meet the gold-standard criteria isn’t a bad pollster. There are rigorous pollsters, such as SurveyUSA and YouGov, that don’t use live interviewers. We’ll get more into this part of the story at a later date.
When judging polling we typically look at the final three weeks of a campaign because polling conducted earlier may simply reflect changes in the nature of the race.
Depending on when in the three weeks the poll is taken.
When you don’t count the notoriously bad Zogby Interactive, partisan pollsters and shady pollsters such as Research 2000 and Strategic Vision, that average error rate still only falls to 5.3 points.

Footnotes

Comments