Rating pollsters is at the core of FiveThirtyEight’s mission, and forms the backbone of our forecasting models. But, it has been two years since we last revised our ratings. Here, at last, is an update. We have both substantially increased the amount of data that we are evaluating, and significantly refined our methodology.
Before we proceed, an important point of context. These ratings are designed to form an objective assessment of pollster quality with respect to one particular function: their aptitude in accurately forecasting election outcomes, when they release polls into the public domain in the period immediately prior to an election. The emphasis is ours, of course. The ratings may not tell you very much about how accurate a pollster is when probing non-electoral public policy questions, in which case things like proper question wording and ordering become much more important. The ratings may not tell you very much about how accurate a pollster is far in advance an election, when definitions of things like “likely voters” are much more ambiguous. And they may not tell you very much about how accurate the pollsters are when acting as internal pollsters on behalf of campaigns, and when their work is kept private; some of the pollsters with the lowest-rated publicly-released polling are some of the first I’d hire if I were a campaign director.
Nevertheless, this is a pretty massive undertaking. Not only have I updated the database with results from the 2008 General Election (as well as elections to date in 2009 and 2010), but I’ve also backfilled the data going back to 1998, substantially improving coverage and quality. The database now contains 4,670 distinct polls from 264 distinct pollsters covering 869 distinct electoral contests — about treble what it did before.
Which Polls are Included
The database covers all minimally competitive [1] general elections for the offices of President, U.S. Senate, U.S. Representative and Governor in the period from November 3, 1998 through June, 1, 2010, including run-offs, “jungle primaries”, and special elections, in which at least one poll was released into the public domain in the final 21 days of the campaign. It also covers all Presidential primary elections during this period. Senate and Gubernatorial primaries are included as of 2010 (but not before); primaries for the U.S. House are not included. Although the vast majority (93 percent) of the database covers polling at the state or district level, national polls for the office of President are also included, as is generic ballot polling for the U.S. House, compared against the outcome of the national popular vote. [2]
Polls are included if their median field date was within 21 days of the election. But, as will be described below, the closer a poll is taken to the election, the higher the standards we have for it; that is to say, “newer” polls are penalized, and older poll receive a bonus. There are a handful of exceptions to the 21-day rule. For instance, no Presidential primary polls are included from states other than Iowa until the Iowa caucus has been conducted, and no polls are included for states other than New Hampshire states until the New Hampshire primary has been conducted. In addition, a more advanced cut-off date may be applied in cases where a candidate who had a tangible chance of winning the election drops out of it prematurely.
All polls from a particular pollster are included, even if they have surveyed the race multiple times within the 21-day window.[3] Polls, especially those from weaker pollsters, sometimes bounce around a lot before “magically” falling in line with the broad consensus of other pollsters. In the 2008 Democratic primary in Wisconsin, for instance, which Barack Obama won by 17 points, American Research Group had released a poll on the Saturday prior to the election showing Obama losing to Hillary Clinton by 6 points; it then released a new poll 48 hours later showing Obama beating Clinton by 10 points. (It is very unlikely that there was in fact such dramatic late movement toward Obama, as most other pollsters had shown him well ahead the whole time). In any event, our model now accounts for this sort of issue. The only exception is that in the case of tracking polls, for which results must be non-overlapping. For example, in the event of a 4-day tracking poll, we’d take the final poll that the pollster released prior to the election, but then go back 4 days to include the poll before that, and so forth.
We flag polls which meet our definition of a partisan poll, which is quite narrow and quite specific: “a poll … conducted [on behalf of] any current candidate for office, a registered campaign committee, a Political Action Committee, or a 527 group.” Nevertheless, they are included in the ratings. If a pollster releases a poll into the public domain, we assume that they are interested in doing their best and most accurate work, regardless of whom the poll is conducted for.
The quality of the data is generally fairly high. However it is compiled from eight different data sources, and some minor mistakes are unavoidable. We cross-checked all polls which appeared to show abnormally large errors to ensure that they were recorded properly in the database. For a small portion (under 10 percent) of the database, the sample size was not known and was estimated based on the margin of error listed by the pollster, or by the sample size which that pollster generally uses. For about 5 percent of the database, the field dates of the poll were unknown and were estimated instead from the release date.
It is sometimes tricky to figure out whom to attribute the poll to. Many polls are conducted on behalf of media organizations, e.g. Zogby International on behalf of the Toledo Blade. In general, we are more inclined to assign the poll to the firm which conducted the field work, rather than the sponsor. But some large media organizations conduct their own field work, or otherwise put a strong methodological imprint upon the poll. There are simply some judgment calls that have to be made, the principle being to assign the poll to the organization which is likely to have added the most value to it (or subtracted the most value from it).
There are some cases in which a media organization has switched field providers; we would generally place these polls into separate groups. For example, in 1998-2000, CNN used Yankelovich to conduct its polls; these polls are classified as “Yankelovich”. In 2002 and 2004, it partnered with Gallup; these polls (along with all polls Gallup has conducted on behalf of itself or USA Today) are classified as “Gallup”. Since 2006, it has partnered with Opinion Research, these polls are classified as “CNN / Opinion Research”. To take another example, the Los Angeles Times recently discontinued its in-house polling operation and now sponsors polls conducted by USC; these polls will be classified as “USC”, and the pollster rating historically associated with “Los Angeles Times” polls will not be applied to them.
Another tricky case is that of CBS/New York Times and ABC/Washington Post, as these organizations very often collaborate on polls but occasionally release polls with just one of the two brand names attached. We have traditionally grouped polls conducted by these organizations together and will continue to do so. In the two cases, however — these are Harris International and Zogby — of a firm which releases both telephone-based and Internet-based polls, they are accounted for separately.
Calculation of ‘Raw’ Ratings
Some types of elections are significantly easier to forecast than others. For instance, among all the polls in our database, the average miss on the outcome of the U.S. presidential national popular vote was just 2.8 points, which approaches the theoretical maximum that might be achieved given the limitations owing to sample variation.[4] On the other hand, polls of Presidential primaries have missed by an average of 7.5 points, and for Senate and gubernatorial primaries, by 7.8 points. Moreover, these differences can vary from year to year. For instance, in 2000, there was a fair amount of late movement toward Al Gore, whereas the polling was quite stable during the last three weeks of the 2004 and 2008 presidential campaigns.
Average Error, All Polls in FiveThirtyEight Database (1998-Present)
The idea, therefore, is to rate a pollster’s accuracy in the context of the “degree of difficulty” of the races that it is covering. A pollster deserves more credit for correctly forecasting the balloting in a difficult primary, or for a U.S. House seat in an obscure district, than it would for calling the outcome of the Presidential popular vote (and it deserves more tolerance when it is wrong).
The procedure that we use to accomplish this consists of a regression analysis which attempts to understand the error incumbent to a poll, in consideration of (i) the pollster in question; (ii) the election, or type of election; (iii) other characteristics of the poll, such as its recency and its sample size. The measure of error which we use is the so-called ‘Mosteller 5’ metric which evaluates the margin between the top two candidates in the race. [5] For instance, if a poll shows the Democrat beating the Republican by 4 points, and the Republican instead beats the Democrat by 2 points, that is considered a 6-point error.
In particular the regression analysis accounts for the following variables:
(a) A series of dummy variables to correspond to each pollster in the database. The coefficients associated with these variables can be thought of as reflecting the amount of error that a particular pollster adds or subtracts, all other things being equal — that is, it reflects the pollster’s skill.
(b) Variables to indicate the recency of the poll, as measured by the number of days between the poll’s median field date and the election. The functional form of this variable is determined by the square root of the number of days between the poll and the election. Separate variables are used for primary and general elections, as primary polling is considerably more time-sensitive than general election polling.
(c) A variable to indicate the sample size of the poll. It is desirable to strip out the sample variation from pollster skill, as our formula for weighting polls accounts for the sample size separately. With that said, for a variety of reasons which are explained in the footnotes, this variable is no more than marginally statistically significant. [6]
(d) A series of dummy variables to indicate the type of election, and the cycle — for instance, “Gubernatorial Elections in 2000”. [7]. In the case of elections to nominate a candidate, there is a separate variable to indicate whether the election was a primary or a caucus. (Polls of caucuses, especially outside of Iowa, are associated with significantly higher errors.)
(e) A series of dummy variables to indicate the particular race under consideration (for instance — “Ohio Senate general election, 2000”), in cases where the polling for that election is deemed to be robust. [8]
In other words, in cases where there is a significant amount of polling in an election, we compare the performance of a poll against other polls of the same election. In cases where there is not very much polling (sometimes, in fact, only one pollster will have surveyed a race) we compare its performance against polls of “similar” elections, i.e., elections for the same type of office in the same cycle. [9]
The regression analysis is weighted based on the number of surveys that the pollster conducted for that particular election, and based on how recently the election occurred; polling from the 2008 election receives about twice as much weight as the 1998 election. [10]
The goal of this regression analysis, again, is to determine the coefficients associated with each of the 264 pollsters, which forms our measure of pollster skill. These coefficients, which are re-calibrated such that the average score is zero, are what we call a pollster’s rawscore. For instance, SurveyUSA’s rawscore is -0.82, meaning that a SurveyUSA poll has about eight-tenths of a point less error than average. American Research Group’s raw score is +0.72, meaning that it has seven-tenths of a point more error than average.
Reversion Toward the Mean
Technically speaking, our goal is not to evaluate how accurate a pollster has been in the past — but rather, to anticipate how accurate it will be going forward (as these ratings are principally used in conjunction with our electoral forecasting). This requires us to regress the raw scores toward the mean, since there is a very significant amount of luck involved in polling over the short-run. Consider that a poll of 600 voters will miss by an average of 3.5 points on the margin between the two candidates, accounting to sample variance alone. If the very best pollsters only contribute about 1 point of skill (as seems to be the case), and the worst ones only subtract about 1 point of skill, we would to look at a large number of elections before we can separate out the skill from the luck. This is especially so because elections are not entirely independent of one another, e.g., a pollster’s skill in forecasting the outcome of the 2008 Presidential election in Pennsylvania is probably correlated with its skill in forecasting the same election in Ohio. Be very suspicious of pollsters who claim to be superior on the basis of only having called one election correctly. [11]
Although our previous version of the ratings had accounted for reversion to the mean, it had done so in a somewhat ad hoc way. For this version, I analyzed the appropriate degree of mean-reversion more robustly. Specifically, I ran a version of the raw score calculation that accounted for polling through 2006 only, and another version that accounted for polling in 2008 only, and then ran another analysis to regress the latter raw scores on the former on the former. In other words, we were trying to see how accurately we could predict a pollster’s performance in 2008 based on its performance prior to 2006.
The regression analysis found that the optimal degree of mean-reversion is proportional to the square root of the number of polls in the sample. Specifically, it dictated the following formula:
reversionparameter = 1 – (0.06 * sqrt(n))
where n is the number of previous polls in the sample from that pollster. For instance, with a pollster which had conducted 25 polls:
reversionparameter = 1 – (0.06 * sqrt(25))
reversionparameter = 1 – (0.06 * 5)
reversionparameter = 1 – (.30)
reversionparameter = .70
In other words, we would regress this pollster’s rawscore 70 percent of the way toward the mean, and “keep” 30 percent of it. (In plain English, at a sample size of 25 polls, we are still mostly looking at noise rather than skill. So if, say, this pollster had compiled a rawscore of -0.50, this would be regressed back to a -0.15.
The minimum amount of mean-reversion is set at 10 percent. What this implies is that after 225 polls, the amount of weight which we attach to a pollster’s previous track record is essentially maxed out. In addition, for reasons that are explained in the footnotes [12], rawscores superior to -2.22 are truncated to -2.22.
But toward which mean do we regress?
Although we have reached a fairly technical place in our discussion, the following point is essential and constitutes a major change from our previous version of the pollster ratings.
Is the only thing which is useful in forecasting a pollster’s future performance his prior track record (regressed to the mean as appropriate?). Or are other, more qualitative features of the pollster also worth considering?
It turned out, when I ran my analyses, that the scores of polling firms which have made a public commitment to disclosure and transparency hold up better over time. If they were strong before, they were more likely to remain strong; if they were weak before, they were more likely to improve.
The variable I used to denote disclosure/transparency is a dummy indicating whether, as of June 1, 2010 [13], the polling firm was either a member of the National Council on Public Polls, or had committed to the AAPOR Transparency Initiative [14]. (Neither NCPP nor AAPOR endorse these ratings in any implicit or explicit way, although NCPP President Evans Witt provided me with a current membership list). This ncppaapor variable was statistically significant at approximately the 95 percent significance level in predicting performance in 2008, when placed to a regression analysis along with past pollster performance; in fact, it was almost as important in predicting future performance as past performance itself.
In a separate (although related) analysis, the ncppaapor variable is statistically significant at approximately the 95 percent level in predicting current raw scores, when it is placed into a regression analysis along with variables indicating whether the polling firm was partisan [15], and whether its polling was conducted over the Internet. [16]
Therefore, the ratings for polling firms which were members of NCPP, or which had signed onto the AAPOR Transparency Initiative, as of June 1, 2010, are regressed toward a different mean than those which hadn’t. Essentially, then, polling firms are rewarded for having made a public commitment to disclosure and transparency — but the basis for rewarding them is statistical rather than ideological. Note, however, that this will mostly effect firms with relatively few polls; it will barely make a difference in the case of a firm like Rasmussen Reports or SurveyUSA which have surveyed hundreds of elections, as their ratings are subject to only a minimal degree of mean-reversion.
(Internet-based polls are also regressed toward a different mean. We do not regress the scores of campaign pollsters toward a different mean, even though they clearly perform worse, because we only include their polls in our projections when they have not been conducted on behalf of a campaign.)
For the time being, I’m going to pass on the question of whether firms which join NPRR or the AAPOR Transparency Initiative subsequent to today will receive the “bonus”; it is a sticky issue, and one which I simply haven’t decided upon. Certainly no firm is going to get rewarded if it fails to transparent in actual practice as well as in theory. What we can say is that firms which have already joined on to one of these efforts tend to be more accurate than those which haven’t.
The Final Step: Pollster-Introduced Error
As mentioned above, different types of polls have their rawscores regressed toward different means. Polling firms which are members of NCPP or the AAPOR Transparency Initiative have their scores regressed toward a mean of -0.50. Those which aren’t, but which conduct their polls by telephone, are regressed toward a mean of +0.26. And those which conduct polling by means of the Internet are regressed toward a mean of +1.04. [17]. We can call this variable groupmean.
A pollster’s adjusted score, or adjscore, is then calculated as;
adjscore = (rawscore * (1 – reversionparameter)) + (groupmean * reversionparameter)
Adjusted scores, like rawscores, may be either positive or negative, with negative scores indicating superior performance and positive scores inferior performance. Traditionally, however, we have expressed pollster skill by means of a positive variable called Pollster-Introduced Error or PIE, which is a measure of how much avoidable error which the pollster introduces. That is, it is a measure of how much error a polling firm introduces other than sampling error, and other than temporal error (i.e., the error introduced by the necessity of having to conduct polls some number of days in advance of the election). PIE is calculated simply by taking the adjscore and adding 2 points to it.[18]
PIE = adjerror + 2
The lower the PIE, the better the pollster; the minumum possible PIE is zero.
The new pollster ratings will be posted in a separate thread in approximately an hour.
Notes
[1] Very few elections fail to meet the threshold of ‘minimally competitive’. The exceptions would be a primary election after all viable candidates have dropped out, or a general election in which a major-party candidate is not opposed by a candidate from the other major party, and the third party candidates are not viable. These elections are rarely polled and only about 5-10 polls were excluded for this reason.
[2] A slight adjustment is made to the national House popular vote calculation to account for the fact that some states (Arkansas, Florida and Louisiana) do not tally ballots when a candidate runs unopposed; we assume that such candidates would have received votes equal to 75 percent of the average district-level turnout in that year’s Congressional elections.
[3] One of our eight data sources includes only the last poll by each pollster in advance of the election; the other seven include all polls. However, there is significant redundancy in our coverage, so only about 5 percent of the races are impacted by this.
[4] Note, however, that the pollsters may have been a bit fortunate in this respect — particularly in 2004 and 2008, where the polling trend was very flat in the final few weeks of the campaign. There are some other relatively recent elections, like 1980 and 1996, in which some of the national polls did quite badly.
[5] That is, the candidates who actually finished in first and second place — not necessarily the candidates who were predicted to finish in first and second place by the consensus of pollsters.
[6] The t-value on the sample size variable in our model — which is actually expressed as the square root of the sample size — is only 1.30, indicating that it is statistically significant only at about the 80 percent confidence level. However, this should probably not be interpreted literally. The problem is that pollsters tend to use about the same sample sizes for the same types of elections, e.g., a Research 2000 general election poll will usually consist of 600 people. This makes it difficult to segregate out the effects of sample size, since they may be captured instead in the large series of dummy variables used to denote the pollster and the election or type of election. What we’d ideally want to have is instances where, for instance, Quinnipiac released a poll of 2,500 people in advance of an election, and another poll of just 500 people later on for the same election. But such cases rarely occur in practice. With that said, it probably is also the case that sample sizes are in fact less important than they theoretically should be, because true sample sizes and margins of error are significantly impacted by things like demographic weighting and cluster sampling, and the pollsters rarely account for this when disclosing their margins of error.
[7] Odd-year elections (e.g. 2005) are grouped with the next even-numbered year (e.g. 2006). Senate and gubernatorial primaries placed into the same grouping.
[8] Polling is deemed to be ‘robust’ if:
(a) At least three distinct nonpartisan pollsters have surveyed the contest, and each of those pollsters have surveyed at least three different contests throughout the database, and
(b) At least two of the three pollsters make a ‘short list’ of prolific pollsters, which (i) have surveyed at least three distinct races in each of at least three distinct states and (ii) are not campaign pollsters; (iii) are not Internet-based pollsters; (iv) are not Strategic Vision, since Strategic Vision’s polling was probably fake.
The idea here is that performance relative to other pollsters is only meaningful to the extent that we know a fair amount about those other pollsters. If we don’t know very much about them, there is nothing to anchor to, and it is probably better to evaluate error relative to similar types of elections in other states or districts.
[9] Careful readers will note that the series of variables used to designate the type of election are effectively redundant in cases where the polling for a particular election is deemed to be robust [8] and is given its own variable. Effectively, then, the election-type variables reflect the error in all elections within that cycle in which polling was not robust. The reason this distinction is important is because elections with nonrobust polling tend to be associated with larger errors (across individual polls — not just in the polling average) than elections with robust polling. This may be because the pollsters are avoiding these types of elections for a reason — there is some evidence, for instance, that lopsided elections are actually harder to forecast than close elections. But it may also be that it helps the pollsters to see what one another are doing and to have some ‘guidance’ when doing things like constructing likely voter models or demographic weightings. In any event, the way that our regression is structured, it implicitly accounts for this ‘Wisdom of Crowds’ effect; that is, a pollster will not be punished (over the long run) for being adventurous and releasing polls in elections that few other pollsters are evaluating.
[10] The weighting parameter contains two components, which are multiplied together. The first component is designated as 1/sqrt(n), where n is the number of polls that a firm has conducted of a particular race. The second component weights the poll based on its date, where a weight of 0.5 is given to polls conducted on 11/3/1998 (Election Day 1998), and a weight of 1.0 is given to polls conducted on 11/4/2008 (Election Day 2008), and all other dates are scaled accordingly. Neither of these weightings make an especially material difference in the results.
[11] I write: “Be very suspicious of pollsters who claim to be superior on the basis of only having called one election correctly.” But the converse is less true; if a pollster is really terrible, you may not need very many elections to determine this. This is because accuracy is bounded on the upside (you can’t do better than calling the margin exactly right — having no error), but effectively unbounded on the downside (a pollster could miscall the election by 30 or more points).
[12] The reason is that the theoretical minimum average error is on the order of 3.5 points, which is the average error owing to sampling variance when conducting a poll of 600 people. On the other hand, the average error for all polls in our database is about 5.5 points. Thus, the maximum possible skill, over the long run, is probably about 2 points, and skill in excess of 2 points can reasonably be assumed to be noise or luck. The reason we set the threshold at -2.22 points rather than -2.00 points is because we separately regress rawscore to the mean by a minimum of 10 percent, and so a pollster with a rawscore of -2.22 would have it truncated to -2.00 anyway at the mean-reversion step of the analysis. This sets things up in such a way that the minimum possible Pollster-Introduced Error (PIE) is 0.00.
[13] Although it is problematic to run a regression analysis on 2008 data based on a condition which was not in existence until 2010, I also ran the analysis based on old NCPP membership lists (which would have been salient as we entered the 2008 cycle), and got very similar results.
[14] Mere membership in AAPOR is not enough, since AAPOR (unlike NCPP) is an organization of individuals rather than polling firms. However, commitments to AAPOR’s Transparency Initiative were made at the firm level, so it is suitable to be used in conjunction with NCPP membership.
[15] For these purposes, a firm was designated as ‘partisan’ if a majority of its polls included in our database were conducted on behalf of a candidate, campaign committee, PAC or 527 group.
[16] You may note that the coefficient on the Internet variable is not actually statistically significant. Indeed, assessing the accuracy of Internet polls is problematic, as Zogby’s interactive polling has been abominable, whereas Harris Interactive and YouGov have had fairly decent results. (A fourth Internet-based pollster, Knowledge Networks, has only three surveys in our database). For the time being, I nevertheless think it is wise to regress Internet-based polls toward a different mean.
[17] Note that two firms — Harris Interactive and Knowledge Network — both meet the NCPP/AAPOR disclosure test and conduct polling by means of the Internet. They are regressed to the average of the NCPP/AAPOR and Internet means, which is +0.28.
[18] The reason why 2 points, rather than some other number, is added to adjscore to produce PIE is explained in note [12]: this is the difference between the theoretical minimum error owing to sampling variation and the average error observed among all polls in our database.