Why Aren’t There More Pollster Ratings?

Over the past week, Pollster.com has published roughly five different critiques of my pollster ratings. While some of their arguments are pedantic, many others are well considered, and they may motivate some changes when we publish a new set of pollster ratings, likely in 5-8 weeks. The majority of the critiques, however, do not point toward anything wrong with our model per se, but instead simply toward assumptions which could have been made differently, or toward challenges which are intrinsic to the exercise, or toward what are essentially political problems that have nothing to do with the quality of the research.

One reason that our pollster ratings attract so much attention, undoubtedly, is because they’re really the only product of their kind. People seem to think they require some kind of Herculean effort. Mark writes:

Finally, you simply have to give Nate credit both for the sheer chutzpah necessary to take on the Everest-like challenge of combining polls from so many different types of elections spanning so many years into a single scoring and ranking system. It’s a daunting task.

I have to disagree with this. Building the pollster ratings was not an Everest-like challenge. It was more like scaling some minor peak in the Adirondacks: certainly a hike intended for experienced climbers, but nothing all that prohibitive. I’d guess that, from start to finish, the pollster ratings required something like 100 hours of work. That’s a pretty major project, but it’s not as massive as our Presidential forecasting engine, or PECOTA, or the Soccer Power Index, all of which took literally months to develop, or even something like the neighborhoods project I worked on with New York magazine, which took a team of about ten of us several weeks to put together. Nor is anything about the pollster ratings especially proprietary. For the most part, we’re using data sources that are publicly available to anyone with an Internet connection, and which are either free or cheap. And then we’re applying some relatively basic, B.A.-level regression analysis. Every step is explained very thoroughly.

So I get a little bit perturbed when I see people who clearly have the skill set to develop their own set of pollster ratings instead invest so much time in debating assumptions which we’ve acknowledged are debatable, or in pointing out flaws which we’ve declaimed ourselves. It would not take them orders of magnitude more time to simply develop their own products from scratch. And it would do a lot more good for the community. If, for instance, several different techniques all found that pollster XYZ is bad, but pollster EFG is good, that would give us more confidence about those results. Meanwhile, some techniques might find pollster PDQ to be relatively strong, while others would find it to be relatively weak. There, we’d have to proceed with more caution.

I don’t know if the critics are interested in doing that. Indeed, the title of Blumenthal’s article is: “Rating Pollster Accuracy: How Useful?”. To be fair, given that the article is not really a critique of pollster ratings as an abstraction, but instead of the particular assumptions that we at FiveThirtyEight have made, it should probably instead have been headlined something like “Pollster Ratings Might or Might Not Be Useful, But We Think Nate Silver’s Approach Kind of Sucks.” But that would be neither pithy nor well-mannered.

But take the title of Blumenthal’s article at face value. It is a fairly audacious statement. There are some fields such as earthquake forecasting (which I write about in my book) in which the signal-to-noise ratio is incredibly low. People have been trying to predict earthquakes for millennia and have made essentially no progress. There are legitimate debates about whether this is something that researchers ought to focus their attention on at all.

In polling, the signal-to-noise ratio is lower than we’d like. In our estimation, the best pollsters are about 1.5 or 2.0 points of error better than the worst ones, on average. In contrast, the error according to sample variance alone is something like 3.5 points per poll. But it is not hopelessly low — and the task is not inherently more daunting than, say, trying to determine which are the most skilled baseball players are based on their batting averages. There too, the difference between a great hitter and a mediocre one is relatively small in absolute terms, and boils down to getting an extra hit once every ten at-bats or so. But this doesn’t mean that there aren’t real differences in skill among baseball players, nor that batting averages, even over relatively small samples, don’t tell us something. Nobody would think to write an article entitled: “Rating Baseball Players: How Useful?”

The question of whether rating pollsters is useful at all would be more salient if, as in the case of earthquake prediction, many different approaches had been attempted, and they were either in gross disagreement with one another or demonstrated very little skill at all. Then we might reasonably conclude that this was intrinsically a quixotic endeavor, and that you might as well just assume that all pollsters were about equally skilled and take a simple average of their figures. But it is premature to pose a question like that when so little serious work has been done, particularly when the person posing the question is manifestly capable of doing it, and has chosen not to.

I doubt that Blumenthal’s reluctance to pursue the question reflects any lack of industriousness on his part. Instead it may reflect a cautiousness motivated by sentiments like these, which he expresses in his article:

Now I want to make clear that I do not question Silver’s motives in regressing to different means. I am certain he genuinely believes the NCPP/AAPOR adjustment will improve the accuracy of his election forecasts. If the adjustment only affected those forecasts — his poll averages — I probably would not comment. But they do more than that. His adjustments appear to significantly and dramatically alter rankings prominently promoted as “pollster ratings,” ratings that are already having an impact on the reputations and livelihoods of individual pollsters.

That’s a problem.

I had two reactions when I read this excerpt.

The first reaction is that the aggregate demand for polling is probably something close to a fixed quantity. If one pollster gets fired, another one gets hired. Someone’s business shrinks, but someone else’s grows.

On what basis are those decisions made now? I don’t mean to sound overly cynical, but it is probably something like: who has the best handshake, who has the best phone voice, who put together the best PowerPoint, who has the nicest-looking website, who knew so-and-so at such-and-such business school, and who is the best bullshitter relative to their price point. To the extent that performance is taken into account, it is probably in specious ways, such as how the pollster faired in the last two or three polls they conducted for the media outlet, or in whichever set of surveys the pollster happened to cherrypick, which tells you almost nothing. It is extremely arbitrary. In contrast, we offer an evaluation which at least attempts to be empirical, objective, unbiased, comprehensive, and robust. Perhaps it is imperfect (of course it is; any other approach would be too), but it ought to significantly contribute to the market’s ability to spend its money efficiently.

The second reaction is that it is indeed problematic, and even a little dangerous, for any one site like FiveThirtyEight to have that much influence over the market. I don’t know whether we have that kind of influence or not. But it isn’t my problem, except to the extent that I find it a bit creepy. Instead, it’s the “fault” of a marketplace that has failed to develop alternatives to our ratings, in spite of the barriers to entry being relatively low. I’m sure that if Pollster.com developed its own set of pollster ratings, or Real Clear Politics did, they’d also be taken seriously, and deservedly so. The same would be true if some qualified academic did, or even some “outsider” like Tom Tango who had strong skills in applied research.

But this hasn’t happened, for reasons I don’t fully understand. I think part of it is that the overlap of skills and interests required to motivate something like a set of pollster ratings occurs relatively rarely outside of academia, and the culture of the Academy is very conservative in many ways. And outside of academia, deigning to rate anything from restaurants to baseball players usually results in a lot of people being fed up at you; it’s not the best way to make friends. The Pollster Ratings are a product of which I’m immensely proud, but it would be better if they had some competition.

Comments