Pollster Ratings, v3.1

Thanks to some helpful comments on the previous version of these ratings, I have cleaned up several sloppy mathematical assumptions. I have also added new polling data from roughly 12 contests in the 2004 primaries into the database.

I am going to go ahead and duplicate most of the (long) discussion from the previous version, such that this article stands on its own. For those who are just browsing through, however, here are the topline results: our estimate of the amount of error that each pollster introduces as a result of methodological error. Note that these estimates are designed to be applied to general election polling data; the errors would be much higher across the board if we were looking at the primaries.

Full Methodological Description

I presently have a database of 171 different electoral contests since 2000 that have been surveyed by at least three of the 32 pollsters that I include in my study. These include contests for President, Senate and Governor, as well as polls from the presidential primaries. A poll is included if (i) it is the last poll that an agency put out before the election; (ii) it was released no later than two weeks from the election date.

My goal is to isolate the instance of methodological error, or what I call Pollster-Introduced Error (PIE). If you look at a poll at any given time before an election, there are essentially three different sources of error:

Total Error = Sampling Error + Temporal Error + PIE

Sampling Error is the error that pollsters typically report in their margin of error calculations — the error intrinsic to sampling only a subset of the entire voting population. All polls have sampling error, though of course a pollster can reduce it by including more interviews in its sample.

Temporal Error is the error introduced by taking a poll weeks or months before an election. Temporal Error is a major consideration now, in April, when we are looking at polls of the November general election; many things can happen between now and then, and (contrary to the common perception) it is not up to the pollster to predict the future. Temporal Error is incorporated into our model in terms of the uncertainty we build into our estimates. For purposes of the pollster ratings, however, we can ignore Temporal Error (that is, assume it to be zero), because we are limiting our evaluation to polls taken very near to the election date.

That leaves us with our final source of error, Pollster-Introduced Error. PIE is what a tennis aficionado might call “Unforced Error”; it is error that results from poor methodology. As a matter of practice, all pollsters have some PIE, which is why the actual margins of error are always larger than those espoused by the pollster. However, this amount varies fairly significantly from agency to agency — which is the impetus for producing these ratings.

To find the PIE, we will first calculate the Total Error for each pollster, and then deduct its Sampling Error. We begin with a calculation called the Raw Total Error (RTE), which is simply the average number of ‘points’ by which a pollster missed the final margin in a given contest, weighted by the number of pollsters that surveyed that contest. We then compare this against what we call the Iterated Average Error (IAE), which is average error for other pollsters that surveyed the same contest, as determined by a multiple iteration method. This step is functionally equivalent to that which we employed in Version 2.0 of our pollster analysis and is described at greater length there. We then subtract the IAE from the RTE to produce a +/- rating. A negative rating means that the pollster outperformed its peers, while a positive rating means that it underperformed. The +/- score for each pollster in our database is indicated below:

The next step is to move from Raw Total Error to Adjusted Total Error (ATE); this in turn requires two substeps, each of which are fairly simple.

Firstly, we add back in the average error for all general election polls in our database to the +/- rating. This figure happens to be 3.81 points. Note that, in Version 3.0 of these ratings, the global average error was determined based on a combination of primary and general election polling. This is because I was worried that the average error in primary polls had been especially high thus far in 2008 (7.17 points), and I was concerned that there was something intrinsic to this electoral cycle that was leading to the higher errors. However, I have now added more polls from previous years’ primaries to the database, and determined that the average error was just as high in those primary cycles (in fact, somewhat higher; the average error was 8.27 points over about 16 contests between the 2000 and 2004 primaries). Polling primaries is inherently much, much more difficult than polling general elections for three principal reasons: (i) voters selecting from among several different candidates in their own party often like several of the candidates — it therefore takes less to move them from one candidate to another; (ii) voters have less information about the candidates at the primary stage of the process, and their preferences may change quickly as they obtain more information — they also often make up their minds late; (iii) turnout is much less predictable in primaries than it is in general elections. However, since this is a website dedicated to general election polling, we use general election polls to calibrate our results.

The next step is to regress the results of this calculation to the mean. Contrary to what you might hear from a pundit, who might evaluate a pollster based on the success of its most recent poll, there is a lot of ‘luck’ involved in polling. Looking at just a couple of polls will tell you very little about a pollster, and even with as many as 100 polls, there is still fairly significant regression to the mean.

This regression is based on a calculation of the standard error of the mean — more specifically, we regress to the mean error expected based on the pollster’s sample size. Note, however, that we do not regress to the mean for two agencies: Zogby Interactive and Columbus Dispatch. This is because these two pollsters use unconventional methodologies — Internet-based polls and mail polls, respectively, which evidently have resulted in very poor outcomes. There is no reason to give a pollster credit for regression toward the mean when it uses an untested methodology that should intrinsically be associated with larger methodological errors.

Adding back in the global average error and then regressing to the mean gives us our Adjusted Total Error. We are almost ready to infer the Pollster-Introduced Error, but first we need to determine how much of the ATE is attributable to Sampling Error.

The expected sampling error associated with a given sample size is fairly trivial to determine, and can be inferred by means of a binomial distribution. It is approximated by the formula…

80 * n^(-.5)

…where ‘n’ is the number of respondents in the poll. Technically, the expected sampling error varies according to the intrinsic distribution of candidate preferences throughout the population; the sampling error will be somewhat larger when preferences are divided 50:50 than when they are skewed 80:20 toward one candidate. This can be understood intuitively by the fact that when 100 percent of the population prefers a given candidate — think a Baath Party election — your poll will return a perfect result regardless of its sample size. This effect is so trivial, however, that it may be ignored for purposes of this analysis.

Thus, we simply average the expected sampling error for all polls that we have in our database from a given pollster, weighted based on the number of pollsters that surveyed that contest. This result is known as the Average Expected Sampling Error (AESE). The AESE varies from pollster to pollster because some pollsters habitually include more respondents than others, although all bit a few gravitate toward the 500-700 respondent range over the long run.

The PIE — Pollster Introduced Error — can then be determined by deducting the AESE from the ATE. However, this is not a simple, linear subtraction; instead the errors are related by the sum-of-squares formula. (This was one of those steps that we had gotten wrong in Version 3.0 of this analysis). Specifically:

…and therefore:

The PIEs for each pollster (the same ones we listed in the very first table in this article) are provided below:

<!–[if gte msEquation 12]>ATE= AESE2+ PIE2<![endif]–>

The basic way to interpret the PIE is that it is the amount of error that a pollster introduces because of imperfect methodology, in addition to any error that results from its finite sample size. A SurveyUSA poll, for instance, adds only about six-tenths of a point of error more than a methodologically perfect pollster would, while a Marist poll adds more than two-and-a-half points of error. We also apply a generic PIE of +2.19 to all unknown pollsters. This is slightly higher the PIE of the average pollster because we have found that there is some relationship between the number of polls that a pollster produces and its average error — pollsters that release a lot of polls tend to be better than average (no doubt because they get more repeat business).

The essential lesson here is not to give deference to ‘name-brand’ pollsters, or those that are associated with large news organizations. Two of our top three are Internet-friendly operations that use the IVR (‘robocall’) method, while several other highly-rated pollsters are boutique or academic shops that limit their polling to a given region. The strongest pollsters associated with a large news organization are Mason-Dixon, which conducts polls for The McClatchy Company and MSNBC, and Market Shares, which polls for Tribune Company. However, other major media polling shops, such as FOX/Opinion Dynamics and CBS/New York Times, have considerably underperformed over time. Likewise some ‘brand name’ pollsters like Zogby and Gallup have had quite poor results.

The Final Step: Translating Ratings into Weightings

However, we are not quite done yet. While these numbers are interesting in the abstract, they do not tell us how to weight polls when we have more than one poll for a given contest. This final step is resolved by figuring what I call the Effective Sample Size.

Suppose that we have a Rasmussen poll with 500 respondents. We have specified Rasmussen’s long-run PIE to be 1.22 points, whereas the average sampling error associated with a poll of 500 respondents is 3.58 points. The total expected error for this poll is determined by the sum-of-squares of thse two figures, or 3.78 points.

To determine the Effective Sample Size, we must answer the following question: how many respondents would a methodologically perfect have to include for it to have an expected error of 3.78 points? That can be determined by the following equation:

Effective Sample Size = 6400 * (Total Expected Error^-2)
448 = 6400 * (3.78 ^ -2)

In this case, the answer is 448. That is, we would be indifferent between a Rasmussen poll consisting of 500 respondents and a theoretically perfect poll consisting of 448 respondents. Thus, the Effective Sample Size for this poll is 448. Rasmussen is a strong pollster, and so this is a strong result. By contrast, a Zogby poll consisting of 500 respondents would have an Effective Sample Size of 353, whereas a Columbus Dispatch poll of 500 respondents would have an Effective Sample Size of just 93!

The weight we assign to each poll is directly proportional to the Effective Sample Size. For aesthetic purposes, we represent the weighting as a ratio, taken relative to a poll of 600 respondents by a pollster of average quality (long-run PIE of 1.96). Such a poll would have an Effective Sample Size of 441. Thus, the weighting for the Rasmussen poll would be 448/441 = 1.02, for the Zogby poll would be 353/ 441 = 0.80, and for the Columbus Dispatch poll would be 93 / 441 = 0.21. Note that this is before applying any penalties for the recentness of a poll. The recentness rating is still applied as described in the FAQ, and multiplied by the Effective Sample Size calculation to produce the final weighting.

This method has one very interesting quality — namely, that the extent to which we weight one poll relative to another depends on the amount of data we have. When the sample sizes are small, the Sampling Error is large relative to the Pollster-Introduced Error, and therefore there is relatively little difference in the weights assigned to pollsters of varying quality. As sample sizes approach infinity, however, the Sampling Error dwindles toward zero, and nearly the entire difference in the quality of different polls is determined by the Pollster-Introduced Error.

This can be seen in the graph below, where we compare the Effective Sample Sizes of a ‘good’ pollster (SurveyUSA) and a ‘bad’ pollster (American Research Group) at various levels of actual sample sizes.

As you can see, the Effective Sample Size for the ARG poll levels off much more quickly than the Effective Sample Size for the SurveyUSA poll. If we have just a little bit of data from each pollster, our primary concern is simply the reducing the effects of Sampling Error, and so we are relatively indifferent about which pollsters we use. However, once we have a lot of data, we have the luxury of being far more discriminating. We’ll take nearly as much SurveyUSA data as we can get our hands on, for instance, but for ARG, we are much more cautious, because no matter how much ARG data we have, it is still going to be subject to ARG’s methodological deficiencies. In fact, even if we had an infinite amount of ARG data, it would still max out at an Effective Sample Size of 1,113! (SurveyUSA’s theoretical maximum, by contrast, goes all the way up to 17,335).

You might wonder why I have the graph extending all the way out to 5,000 respondents when in practice it is very rare to find a poll with more than about 1,200 or 1,500 respondents. The reason is that while we might not find any one poll with a very large number of respondents, we may get effectively the same thing if we have several different polls from the same agency. For example, in the run-up to the Pennsylvania primary, Strategic Vision conducted general election trial heats of 1,200 respondents in each of five consecutive weeks. In many respects, this is similar to one poll of 6,000 respondents. And from the standpoint of our pollster weightings, we treat it largely the same way.

Specifically, when we have more than one poll from a given agency in a given state, we aggregate the sample sizes, and perform the Effective Sample Size calculation on this basis. We then subtract out the Effective Sample Sizes of any more recent surveys from that pollster to determine the Marginal Effective Sample Size, or MESS (yes, I know you are getting sick of all these acronyms). For example, for a set of polls like the Strategic Vision polls, the calculation would work as follows:

                     Sample SizesDate      Actual    Cumulative    Effective   MESSApril 19  1200      1200          697         697April 12  1200      2400          982         285April 5   1200      3600          1138        156March 29  1200      4800          1235        97March 22  1200      6000          1302        67

What we are doing here is giving priority to a pollster’s most recent poll in a given state. So Strategic Vision’s most recent (April 19th) poll gets credit for its full Effective Sample Size, or 697 people. However, because we aggregate the sample sizes, further polling from Strategic Vision is penalized. If we aggregate the two most recent polls from Strategic Vision (April 19 and April 12), we get a combined Effective Sample Size of 987 respondents. But since we already spent 697 of that 982-person budget on the April 19th poll, we have only 285 people left over for the April 12th poll. This is the Marginal Effective Sample Size (MESS) for that poll.

The central concept is pretty intuitive, which is that the more and more data we get from a given pollster, the more we encounter diminishing returns.

Also note that in addition to the above calculation, we also punish older polls based on their recentness rating. So effectively we have two different ways to penalize redundant polling from the same agency. But it is a mistake to throw out old polls from the same agency completely; getting SurveyUSA’s sloppy seconds may be as good as getting virgin results from a lot of pollsters.

Nate Silver is the founder and editor in chief of FiveThirtyEight.

Filed under , ,