This entry will focus on the methodology we use to generate our regression numbers in our brand spankin’ new Senate projections. The other parts of the model — everything has been retooled to make it Senate-specific — will be explained sometime later this week.
Building a regression model for the Senate is inherently more challenging than doing so for the Presidential race. Rather than 50 manifestations of one race between the same two candidates, we instead have 35 different races and 70 different candidates to deal with. At the same time, having a regression model is arguably a more important feature of doing Senate projections than Presidential projections, as the polling data tends to be much sparser — many races, in fact, have not been polled at all.
Fortunately, I was able — not before extensive trial-and-error — to design a relatively simple four-variable model that does a pretty darn good job, with an r-squared in the high 70s. As a quick reminder, this regression model is not formally speaking trying to predict the outcome of the race. Rather, it’s attempting to predict what the polling results “should” be in a given state. In this respect, it’s the same as our general election regression model, but not the same as our model from the primaries, which was attempting to regress against actual voting results. The four variables are as follows:
1. Incumbency. Well, duh. This is a dummy variable that takes on the value 1 if the Democratic candidate is an incumbent, -1 if the Republican candidate is an incumbent, and 0 if there is no incumbent.
2. Incumbent Standardized Approval Rating. The good thing about approval ratings are that they are extremely abundant, including for races where there is little or no head-to-head polling. The bad thing about approval ratings is … pretty much everything else. They are extremely dependent on question wording, and therefore vary significantly from source to source. So, we work from a standardized versionthat attempts to account for these house effects.
The first step in this process is simply to gather a Senator’s approval scores. Actually, there are (at least) two familair forms of this question — approve/disapprove and favorable/unfavorable. The dataset includes the most recent numbers for all organizations that have released approve/disapprove or favorable/unfavorable ratings for a Senator at some point in the calendar year 2008. This is typically somewhere between 2-4 ratings for each incumbent; in cases where no data has been released on the candidate in 2008, we take the most recent approval poll avaialble. . The scores are then standardized to a 0-to-100 scale by reallocating all neutral/undecided/don’t know responses evenly between approve and disapprove. So, a Senator whose ratings are 50 percent approve, 40 percent disapprove, 10 percent neutral will have a standardized score of 55 (taking his all of his ‘approves’ and half of his neutrals).
The next step is the tricky part. The vast majority of our approval scores come from one of three organizations: Rasmussen Reports, SurveyUSA, and Research 2000; everything else can be lumped together into an “other” category. It turns out that if you look at cases where the same Senator is tested by more than one of these organizations, there are some very substantial systematic differences from agency to agency. In particular, SurveyUSA and Research 2000 tend to produce low approval scores, and Rasmussen Reports and “other” tend to produce high ones. The differences are not trivial: a Senator who polls at 50-50 in a Research 2000 favorability question can expect to poll at about a 57-43 in Rasmussen.
Nobody is right or wrong on this, by the way; the difference arises because Rasmussen’s categories allow people on the fence to hedge with a ‘Somewhat Favorable’ rating, whereas Research 2000 provides only ‘Favorable’ and ‘Very Favorable’. In any event, the process is to translate the approval scores to a neutral standard. So SurveyUS and Research 2000 scores are bumped up a bit, and Rasmussen and ‘Other’ scores bumped downward.
Standardized approval ratings for all incumbent senators running for re-election this year are as follows:
Pryor AR 74.9
Johnson SD 71.8
Cochran MS 70.3
Sessions AL 69.5
Rockefeller WV 68.7
Alexander TN 67.8
Enzi WY 67.4
Collins ME 63.8
Wicker MS 63.8
Landrieu LA 62.6
Baucus MT 61.9
Reed RI 61.9
Durbin IL 61.8
Roberts KS 60.7
Harkin IA 60.1
Graham SC 59.7
Cornyn TX 59.5
Biden DE 59.4
Dole NC 59.3
Levin MI 59.3
Kerry MA 59.0
McConnell KY 58.7
Chambliss GA 58.1
Smith OR 55.5
Inhofe OK 55.3
Coleman MN 52.2
Lautenberg NJ 49.8
Stevens AK 48.0
Sununu NH 47.0
Wyoming Senator John Barrasso, who took over for Craig Thomas last year, has never had an approval poll conducted on him, and so what we do is to assign him Roger Wicker’s score of 63.8, as the two candidates are in somewhat analogous positions (noncontroversial but reliably conservative senator taking over a reliably conservative seat in mid-term). The standardized approval ratings are used for incumbents only. To make the math work properly, Republican incumbents are assigned the negative of whatever their score is, and Democrats the positive.
3. Fundraising Share. Fundraising data, somewhat annoyingly, is only available on a quarterly basis for Senate candidates, and so we are using figured from the FEC’s last filing on 3/31. On the other hand, we aren’t trying to learn anything all that sophisticated from this metric. Basically, we want to know: (i) in a race against an incumbent, does a challenger have the resources to put up a serious fight, and (ii) in a race without an incumbent, who is winning the money race? The specific version of this metric I prefer is the percentage of funds raised by the Democrat out of the total funds raised by both candidates. This formulation produced more statistically significant results than any other variant.
4. Highest Elected Office Achieved. As recently as the first version of the Senate ratings that went up this morning, I had been attempting to use a variable for ideology, designed by building my own liberal-conservative scores for each candidate. The idea was to find out which candidate was closest to the median point of the electorate in his state, and assign him some bonus points as a result. This actually turned out to produce a statistically significant result — understanding ideology is a very important part of understanding Senate races. But it introduced an awful lot of work and an awful lot of subjectivity into the model.
One reason why ideology is important is because not all candidates are particularly serious, and a good litmus test for whether a candidate is running a serious race is whether he is somewhere reasonably close to the middle of the state’s electorate. Someone like Bob Tuke in Tennessee, for instance, looks like a serious enough candidate until you realize that he’s trying to run as a strong progressive in a not-so-progressive state; Christopher Reed in Iowa does not strike one as a particularly serious candidate, but even less so when you see his platform. By contrast, someone like Ronnie Musgrove in Mississippi or Dick Zimmer in New Jersey have platforms that will be tolerated by most of their state. So one thing we are trying to do here is to filter out the wackos.
Another way to vet the candidates, however, is to see the highest elected office they have achieved; if you have been elected by some large number of constituents, odds are that you will have at least some feeling for the popular will in your state. The metric we use to evaluate this is the size of the population governed:
(i) A candidate whose highest elected office is governor gets credit for the entire population of his state;
(ii) A candidate whose highest elected office is senator gets credit for half the population of his state (since each state has two senators);
(iii) A candidate whose highest elected office is the U.S. House of Representatives gets credit for the population in his Congressional District;
(iv) A candidate whose highest elected office is mayor gets credit for the population of his city;
(v) A candidate whose highest elected office is in the state house or state senate gets credit for his state’s population, divided by the number of legislators in that particular body. For example, since Iowa has 50 state senators, a candidate whose highest elected office was serving in that chamber would get credit for one-fiftieth of Iowa’s population (about 59,000 persons).
This system tends to produce fairly intuitive results. The one square peg is John N. Kennedy, the Louisiana Republican whose only elected office is State Treasurer. We treat him as equal in influence to a typical member of the U.S. House and assign an average-sized Congressional District as his population base.
Then, just as we did with the fundraising numbers, we take the share of the total jurisdiction governed by the Democrat as a fraction of the Democrat + Republican total. So, for example, in New Jersey, Frank Lautenberg’s highest elected office is Senate, where he is given credit for governing 4.2 million people (half of New Jersey’s office), whereas Dick Zimmer’s highest eleted office was as a member of the U.S. House in a Congressional District of about 650,000 persons. So Lautenberg’s share of the highest-elected-office variable is
(4,200,000 / (4,200,000 + 650,000).
This works out to about .867; Lautenberg holds about 87 percent of the experience points in his election against Zimmer.
And that’s it. Those four variables account for the entire regression analysis that you see reflected here. A couple of stray notes on particular states:
Minnesota: The regression model sees this race tightening — Norm Coleman is not popular, and Al Franken has nearly matched him in fundraising — but one way to look at why Franken hasn’t quite taken off is that he’s never held elected office before. If he had run for the House before stepping up to the Senate, for instance, he might have emerged as a more vetted candidate, and something like his tax return issue would have been less likely to distract him.
Maine: This is the one state where I think giving up on the ideological variable somewhat hurts the model, as in spite of Maine being a fairly liberal state, Susan Collins is much closer to the median of the electorate than Tom Allen.
South Carolina and Montana: On the other hand, these states were causing problems with the ideological variables because the opposition candidates hold a series of idiosyncratic positions that don’t fit neatly onto the political spectrum but might give the appearance of centrism if averaged together. Bob Conley in South Carolina, for instance, is running as a sort of family values version of Ron Paul; he is quite conservative by the standards of a Democratic candidate, but not in such a way that is really designed to cater to South Carolina’s electorate. Bob Kelleher in Montana, meanwhile, is running as a Republican this year, but has previously run as a member of the Green Party and is strongly in favor of single-payer health care; this might be one of the first cases in history where the Republican candidate was demonstrably more liberal than the Democrat (the relatively centrist Max Baucus). Niether Kelleher nor Conley has ever held elected office, however, so our model is able to recognize them as not-very-serious candidates.
Alaska: Mark Begich’s highest elected office is as the mayor of Anchorage, but almost 40 percent of Alaska’s population is contained within Anchorage’s city limits. So Begich is not actually not very far behind Ted Stevens in this metric, as Stevens is given credit for presiding over half of a very small state.
That should helpfully give everybody some idea of how all of this is working. We’ll follow up and cover the other parts of the projection model at some point later this week.