Regression Model Q&A

I was composing a long reply in the original thread and it got somewhat unwieldy, so let’s move things over here:

Q. Is there any way to see how the regression changes have effected the prediction at the state level?

Even just the states that most effected by the changes for each candidate?

For Obama, the most significant movers were as follows:

Gainers
Delaware       +2.9 --> +12.0
Georgia       -11.9 --> -5.3
Hawaii         +9.7 --> +22.9
Indiana        -9.6 --> -7.1
Iowa           +1.8 --> +8.5
Maryland       +6.4 --> +13.3
Michigan       -1.9 --> +4.3
Minnesota      +2.1 --> +4.6
Nevada         +0.2 --> +5.1
N Carolina    -11.1 --> -4.8
S Carolina    -15.4 --> -9.3
Texas         -14.4 --> -9.7
Wisconsin      +2.4 --> +7.7

Decliners
California    +10.9 --> +8.1
Colorado       +4.5 -->  +1.3
Florida        -7.3 --> -10.6
Idaho          -8.9 --> -19.2
Kansas        -11.5 --> -7.4
Kentucky      -20.4 --> -29.7
Maine         +12.3 --> +6.3
New Hampshire  +9.1 --> +2.8
Pennsylvania   +3.6 --> +0.1
Rhode Island   +7.3 --> +2.3
Tennessee     -15.8 --> -22.9
Vermont       +27.0 --> +21.2
Wyoming       -10.4 --> -21.8

We see the effects of the ‘American’ and Catholic variables in a lot of the states where he’s lost ground. On the other side of the coin, the African-American variable is now showing up as more significant for him than it had before, which is why he gains ground in places like Georgia (where we could definitely use some more polling) and North Carolina. He also gained a lot of ground in Hawaii because the new version of the fundraising numbers seems to do a considerably better job of accounting for home-state effects.

For Clinton, the most significant movers were as follows:

Gainers
Florida         -5.6 --> -2.4
Indiana        -16.6 --> -10.1
Kentucky        -6.8 --> -4.4
Massachusetts  +12.4 --> +15.8
New Jersey      +2.4 --> +6.0
Ohio            -1.8 --> +1.9

Decliners
Arizona        -17.9 --> -20.8
Georgia        -11.8 --> -15.3
Hawaii          +7.7 --> +3.0
Idaho          -30.3 --> -35.7
Louisiana       -5.5 --> -10.4
Maine           +5.5 --> -0.3
Maryland        +7.2 --> +2.5
Michigan        +2.3 --> -1.4
Mississippi    -11.4 --> -23.8
North Carolina  -6.7 --> -11.7
Wyoming        -33.2 --> -39.5

Clinton definitely lost ground in more places than she gained ground … but the places where she gained ground tend to be larger/more important states like Ohio and (probably because of the introduction of the ‘Seniors’ variable) Florida. She also tended to gain ground in the Appalachain/Highlands region, while losing some ground elsewhere in the South where there are higher African-American populations.

Q. Another variable worth considering is the current economic situation of the state, as measured by unemployment or foreclosure rates.

I actually did look at unemployment rates and they turned out not to have a statistically significant impact, although the arrow pointed upward for both Democrats (they were running relatively better where there is higher unemployment) at a nonsignificant level. One limitation I face is that MS-EXCEL can only handle 16 independent variables at any point in time, so I did have to pick-and-choose a little bit. Fortunately, I can test as many variables as I want in my backend statistical package (STATA), so I can make an informed decision about my picking and choosing. By the way, why do I use EXCEL (which generally sucks for any kind of high-powered statistical analysis) at all? Because I can program it to re-run the regression automatically based on new polling data that comes in, whereas otherwise I’d have to go back to my statistical package every time I updated the site. I will continue to test out new variables in STATA from time to time and it’s possible that unemployment rate will make a re-appearance.

Q. Obama does better with Hispanics than Clinton? Or is it that the Clinton coefficient is higher, but not statistically significant?

This result is definitely … interesting … but it’s best to avoid the the temptation to compare the Clinton and Obama models directly. What the models are really doing is comparing the respective candidates’ performances against John Kerry. While Obama might perform worse among Hispanics than Hillary Clinton would, there is no evidence that he would perform worse than Kerry among this group (who only won that vote about 3:2). Also, there are interaction effects between the different variables. For example, most Hispanics are Catholic, but it may be that Obama only has an issue with white Catholics rather than Hispanic Catholics. If that is the case, the Hispanic variable might be necessary in his model to counteract these effects. On the other hand, Clinton is overperforming John Kerry among lower-education voters. Since Hispanics presently tend to have lower levels of educational achievement than whites do, she may already be getting credit for her strength among Hispanics in those numbers.

Q. The education t-score for Obama seems to show that it’s just barely above statistical significance.

And this is another thing that we have to watch out for … there are also interrelationships between the various demographic variables we have and the fundraising numbers. For example, there is quite a strong correlation between education levels and states where Obama has had the most success in his fundraising, so that may make ‘education’ look like a less important factor in Obama’s numbers than the model is actually giving him credit for.

Q. I was just wondering, if Puerto Rico had electoral votes, what the regression model would show for it.

We can’t really say because we have neither Kerry numbers nor fundraising numbers for Puerto Rico, but I think it’s safe to conclude that Puerto Rico would be a hugely Democratic state. In New York, where a lot of the Hispanic population is Puerto Rican or Dominican, Kerry won the Hispanic vote about 3:1.

Q: What the hell is that t-score?

It’s a measure of statistical significance; see here.

Q: What happened with Maine and Rhode Island? Maine is a Toss-Up for Clinton and Rhode Island close for Obama.

In Rhode Island, you have a lot of relatively downscale white Catholics. It’s strongly partisan Democratic leaning will probably prevent it from becoming competitive, but it would not shock me if we saw some polling where McCain was close to Obama. Another way to look at this is that the Massachusetts polling has been fairly poor for Obama, and if Massachusetts is bad for him, than Rhode Island should be worse. His fundraising numbers there have also been quite poor.

Clinton, for her part, has raised very little money in Maine, and the model is now more sensitive to the fundraising stuff, so she has been harmed there. Maine is also rural as opposed to suburban, and Clinton’s strength tends to be in the suburbs. Maine has a substantial centrist/independent streak, and I actually can imagine McCain being competitive there if he is able to seize the political center in a McCain-Clinton matchup.

Q. Funny how every time you adjust your model, it favors Obama. Taking cues from PPP?

Actually, the last adjustment I made to the model — weighting it based on the amount of polling data in a state — took a couple of points off of Obama’s numbers.

But more importantly, the reason I added in all these new variables is because I’m at much greater risk of biasing the results if I leave something out. For example, what justification do I have at looking at education levels but not income levels, or looking at the effects of the Evangelical population but not the Catholic population, or ignoring the effects of something like age? I don’t have any justification at all for doing that. And while there are some risks of overspecification if I include this many variables, overspecification does not lead to biased estimates, whereas underspecification can.

Q: What is the data you’re fitting the models on so far? The state poll results so far this year?

Yes, exactly.

Q: What about veteran population?

There actually do seem to be some effects here on the fringes of statistical significance — Obama overperforms slightly in states with strong ties to the military apparatus, perhaps because they have the most concerns about the war in Iraq. But for the time being, I had to drop this variable because of the limitations that I have in MS-EXCEL.

Q: Would you consider rerunning the “win percentage history” graph when you make changes to your methodology?

I realize this is not ideal, but I can’t re-do the entire history because that would take hours and hours to do. Keep in mind that we are still months and months away from the general election. I am trying to get all the model changes out of the way now, so that there aren’t changes in the numbers once they start to be worth paying more attention to.

FiveThirtyEight

Regression Model Q&A

Comments