For a couple of weeks now, I’ve been working from a new version of my pollster ratings, which account in a more sophisticated way for the relationship between the quality of the pollster, and that poll’s sample size. The Pennsylvania primary provides as good an excuse as for me to refresh those ratings, as well as to do a more thorough job of explaining them.
And be forewarned … the explanation that follows is very thorough, and fairly technical in some places. Eventually, I will endeavor to do a quicker-and-dirtier version of this, but I want to have the detail on the record.
I presently have a database of 159 different electoral contests since 2000 that have been surveyed by at least three of the 32 pollsters that I include in my study. This includes contests for President, Senate and Governor, as well as polls from the presidential primaries (almost all of which are from 2008).
The central difference between the approach I took originally and the one I’m taking this time around is that I’m trying to specifically isolate the instance of methodological error, or what I call Pollster-Introduced Error (PIE). If you look at a poll at any given time before an election, there are essentially three different sources of error:
Total Error = Sampling Error + Temporal Error + PIE
Sampling Error is the error that pollsters typically report in their margin of error calculations — the error intrinsic to sampling only a subset of the entire voting population. All polls have sampling error, though of course a pollster can reduce it by including more interviews in its sample.
Temporal Error is the error introduced by taking a poll weeks or months before an election. Temporal Error is a major consideration now, in April, when we are looking at polls of the November general election; many things can happen between now and then, and (contrary to the common perception) it is not up to the pollster to predict the future. Temporal Error is incorporated into our model in terms of the uncertainty we build into our estimates. For purposes of the pollster ratings, however, we can ignore Temporal Error (that is, assume it to be zero), because we are limiting our evaluation to polls taken very near to the election date.
That leaves us with our final source of error, Pollster-Introduced Error. PIE is what a tennis aficionado might call “Unforced Error”; it is error that results from poor methodology. As a matter of practice, all pollsters have some PIE, which is why the actual margins of error are always larger than those espoused by the pollster. However, this amount varies fairly significantly from agency to agency — which is the impetus for producing these ratings.
For purposes of our study, PIE is inferred by taking the total error, and subtracting out the Sampling Error. For example, in its recent poll of Pennsylvania, Mason-Dixon predicted Hillary Clinton to beat Barack Obama by 5 points. In actuality, Hillary Clinton beat Barack Obama by 9.1 points. That is a total error of 4.1 points. However, some of this error is not the pollster’s “fault”, but instead results from its finite sample size. Specifically, Mason-Dixon surveyed 625 people for its poll; on average, a poll of 625 respondents will miss the final margin by 3.2 points because of sampling error alone. Thus, the Pollster-Introduced Error for this poll is 4.1 points less 3.2 points, or 0.9 points. If Mason-Dixon had surveyed more people, we would attribute less of its error to Sampling Error, and more to Pollster-Introduced Error.
(Note: the average error associated with a given sample size is fairly trivial to determine, and can be inferred by means of a binomial distribution. It is approximated by the formula…
80 * n^(-.5)
…where ‘n’ is the number of respondents in the poll. Technically, the average sampling error varies according to the intrinsic distribution of candidate preferences throughout the population; the sampling error will be somewhat larger when preferences are divided 50:50 than when they are skewed 80:20 toward one candidate. This can be understood intuitively by the fact that when 100 percent of the population prefers a given candidate — think a Baath Party election — your poll will return a perfect result regardless of its sample size. This effect is so trivial, however, that it may be ignored for purposes of this analysis).
In this way, we calculate the PIE for every poll in our database. (In some cases, the PIE for an individual poll will be less than zero. While it is impossible for a pollster to have a PIE less than zero over the long run, we retain any sub-zero result for purposes of calculating the pollster’s average PIE.) The PIE is then compared, by means of an iteration method, against that of other pollsters which surveyed the same contest, a result that we describe as Iterated Average Error (IAE). This step is functionally equivalent to that which we employed in the previous version of our pollster analysis; as before, we weight the results by the number of pollsters that surveyed a given state.
The PIEs and IAEs for each pollster in our database is indicated below:
The salient number in this table is the +/- rating. This is the extent to which the pollster overperformed or underperformed the other pollsters that surveyed the same races. For example, a SurveyUSA poll has introduced on average 1.13 points of error (PIE) less than that of an average pollster, while a CBS/New York times poll has introduced 0.81 points of error more than average.
In order to translate these numbers into something that we can apply as pollster weightings, we need to take three additional steps. Two of these steps are trivial, while the third is relatively complicated.
The first step is to regress the +/- rating to the mean. This is based on a straightforward calculation of the standard error of the mean. However, note that we do not regress to the mean for two agencies: Zogby Interactive and Columbus Dispatch. This is because these two pollsters use unconventional methodologies — Internet-based polls and mail polls, respectively, which evidently have resulted in very poor outcomes. There is no reason to give a pollster credit for regression toward the mean when it uses an untested methodology that should intrinsically be associated with larger methodological errors.
The second step is to add back in the average PIE for all pollsters. For example, a SurveyUSA poll might be 1.13 points better than an average poll (or 0.77 points better following regression to the mean), but what is its PIE in absolute terms? The average PIE for all polls in our sample is approximately 1.80 points; however, this includes a disproportionate number of primary election polls, which inherently tend to be associated with larger errors, whereas this site is focused on general election polls. To somewhat reduce these effects, we take the average PIE from among the five election cycles that we have in our database, which works out to 1.49 points:
Cycle PIE
2000 - General 1.35
2002 - General 1.24
2004 - General 0.30
2006 - General 0.64
2008 - Primaries 3.90
Global Average 1.49
You should note that we still give some weight to the primary election polls from 2008 in calculating our average PIE. This is because the polls in this election cycle have been much, much less accurate than polls in previous (general election) cycles. While my suspicion is that this has mostly to do with the nature of polling primaries, rather than the nature of the 2008 electoral cycle, I cannot say this for certain because I have very little data from previous years’ primaries to look at. Therefore, we hedge our bets a little bit (this should be considered a conservative assumption).
Adding the global average PIE to the +/- score for our pollster (following regression to the mean) produces its long-run PIE. For example, SurveyUSA’s raw +/- score is -1.13, or -0.77 following regression to the mean. Adding 1.49 to this figure produces 0.72. That is, SurveyUSA introduces, on average, 0.72 more points of error than a methodologically perfect poll would (e.g. one having only sampling error, but no pollster-introduced error). This as it happens is an exceptionally strong result — the best result in our database The long-run PIEs for all 32 pollsters are provided below.
The essential lesson here is not to give deference to ‘name-brand’ pollsters, or those that are associated with large news organizations. Of our six strongest pollsters, the first two are Internet-friendly operations that use the IVR (‘robocall’) method, while the next four are boutique or academic shops that limit their polling to a given region. The strongest pollster associated with a large news organization is Mason-Dixon, which conducts polls for The McClatchy Company and MSNBC. However, other major media polling shops, such as FOX / Opinion Dynamics and CBS/New York Times, have considerably underperformed over time. Likewise, some ‘brand name’ pollsters, like Zogby and Gallup, have had quite poor results.
The Final Step: Translating Ratings into Weightings
However, we are not quite done yet. While these numbers are interesting in the abstract, they do not tell us how to weight polls when we have more than one poll for a given contest. This final step is resolved by figuring what I call the Effective Sample Size.
Suppose that we have a Rasmussen poll with 500 respondents. We have specified Rasmussen’s long-run PIE to be 0.88 points, whereas the average sampling error associated with a poll of 500 respondents is 3.58 points. The total expected error for this poll is the sum of thse two figures, or 4.46 points.
Total Expected Error = Sampling Error + Long-Run PIE
4.46 = 3.58 + 0.88
To determine the Effective Sample Size, we must answer the following question: how many respondents would a methodologically perfect have to include for it to have an expected error of 4.46 points? That can be determined by the following equation:
Effective Sample Size = 6400 * (Total Expected Error^-2)
322 = 6400 & (4.46 ^ -2)
In this case, the answer is 322. That is, we would be indifferent between a Rasmussen poll consisting of 500 respondents and a theoretically perfect poll consisting of 322 respondents. Thus, the Effective Sample Size for this poll is 322. Since Rasmussen is a strong pollster, this is actually a fairly good result. By contrast, a Zogby poll consisting of 500 respondents would have an Effective Sample Size of 228, whereas a Columbus Dispatch poll of 500 respondents would have an Effective Sample Size of just 64!
The weight we assign to each poll is directly proportional to the Effective Sample Size. For aesthetic purposes, we represent the weighting as a ratio, taken relative to a poll of 600 respondents by a pollster of average quality (long-run PIE of 1.49). Such a poll would have an Effective Sample Size of 283. Thus, the weighting for the Rasmussen poll would be 322/283 = 1.14, for the Zogby poll would be 228 / 283 = 0.81, and for the Columbus Dispatch poll would be 64 / 283 = 0.23. Note that this is before applying any penalties for the recentness of a poll. The recentness rating is still applied as described in the FAQ, and multiplied by the Effective Sample Size calculation to produce the final weighting.
This method has one very interesting quality — namely, that the extent to which we weight one poll relative to another depends on the amount of data we have. When the sample sizes are small, the Sampling Error is large relative to the Pollster-Introduced Error, and therefore there is relatively little difference in the weights assigned to pollsters of varying quality. As sample sizes approach infinity, however, the Sampling Error dwindles toward zero, and nearly the entire difference in the quality of different polls is determined by the Pollster-Introduced Error.
This can be seen in the graph below, where we compare the Effective Sample Sizes of a ‘good’ pollster (SurveyUSA) and a ‘bad’ pollster (American Research Group) at various levels of actual sample sizes.
As you can see, the Effective Sample Size for the ARG poll levels off much more quickly than the Effective Sample Size for the SurveyUSA poll. If we have just a little bit of data from each pollster, our primary concern is simply the reducing the effects of Sampling Error, and so we are relatively indifferent about which pollsters we use. However, once we have a lot of data, we have the luxury of being far more discriminating. We’ll take nearly as much SurveyUSA data as we can get our hands on, for instance, but for ARG, we are much more cautious, because no matter how much ARG data we have, it is still going to be subject to ARG’s methodological deficiencies. In fact, even if we had an infinite amount of ARG data, it would still max out at an Effective Sample Size of 2,053! (SurveyUSA’s theoretical maximum, by contrast, goes all the way up to 12,499).
You might wonder why I have the graph going all the way out to 5,000 respondents when in practice it is very rare to find a poll with more than about 1,200 or 1,500 respondents. The reason is that while we might not find any one poll with a very large number of respondents, we may get effectively the same thing if we have several different polls from the same agency. For example, in the run-up to the Pennsylvania primary, Strategic Vision conducted general election trial heats of 1,200 respondents in each of five consecutive weeks. In many respects, this is similar to one poll of 6,000 respondents. And from the standpoint of our pollster weightings, we treat it largely the same way.
Specifically, when we have more than one poll from a given agency in a given state, we aggregate the sample sizes, and perform the Effective Sample Size calculation on this basis. We then subtract out the Effective Sample Sizes of any more recent surveys from that pollster to determine the Marginal Effective Sample Size, or MESS (yes, I know you are getting sick of all these acronyms). For example, for a set of polls like the Strategic Vision polls, the calculation would work as follows:
Sample Sizes
Date Actual Cumulative Effective MESS
April 19 1200 1200 452 452
April 12 1200 2400 672 220
April 5 1200 3600 824 150
March 29 1200 4800 941 117
March 22 1200 6000 1035 94
What we are doing here is giving priority to a pollster’s most recent poll in a given state. So Strategic Vision’s most recent (April 19th) poll gets credit for its full Effective Sample Size, or 452 people. However, because we aggregate the sample sizes, further polling from Strategic Vision is penalized. If we aggregate the two most recent polls from Strategic Vision (April 19 and April 12), we get a combined Effective Sample Size of 672 respondents. But since we already spent 452 of that 672-person budget on the April 19th poll, we have only 220 people left over for the April 12th poll. This is the Marginal Effective Sample Size (MESS) for that poll.
The central concept is pretty intuitive, which is that the more and more data we get from a given pollster, the more we encounter diminishing returns. (Also note that in addition to the above calculation, we also punish older polls based on their recentness rating. So effectively we have two different ways to penalize redundant polling from the same agency).
* * *
I recognize that this is an awful lot to digest, so set this aside and come back later with an aperitif. In the meantime, perhaps we’ll try another Q&A session later today.