Mark Blumenthal has a good article up at the National Journal about that controversial USA/Today Gallup poll released last week that showed John McCain with a 4-point lead in its likely voter sample. In turns out that only 10 percent of the likely voter sample consisted of voters between ages 18-29. By contrast, teens and twentysomethings represented somewhere between 16 and 18 percent of the electorate in 2004. And that number is likely to go up rather than down this time around, as youth turnout in the primaries increased by 52 percent as a share of the Democratic electorate.

Now, I appreciate that Gallup is willing to disclose so much about their methodology — it certainly opens them up to more criticism. Having said that, winding up with a sample that understates the youth vote by perhaps 30-50 percent is pretty much prima facie evidence that something has gone awry. Indeed, I’m really not a fan of the Gallup likely voter model at all.

What Gallup does is essentially as follows. Suppose that the entire electorate consists of five voters. Gallup has an algorithm by which they estimate each voter’s likelihood of participating. So what you might get is something like this:

Voter A – 70% chance of voting

Voter B – 50% chance of voting

Voter C – 90% chance of voting

Voter D – 60% chance of voting

Voter E – 80% chance of voting

For my money, the most logical way to handle this — if you’re going to apply some sort of likely voter model at all — would be to multiply each voter’s response by their likelihood of participating. So voter A would be counted at 70 percent weight, voter B at 50 percent weight, and so forth.

What Gallup does instead is to rank the voters in order, and to set an arbitrary cutoff point for how many voters they want to have in their sample. Assuming, for instance, they’re targeting 60 percent turnout, that might look something like this:

Voter C – 90% chance of voting

Voter E – 80% chance of voting

Voter A – 70% chance of voting

———————————

Voter D – 60% chance of voting

Voter B – 50% chance of voting

Voters A, C and E would be included in the likely voter sample (and counted at full weight); voters B an D are dropped.

I think that this winds up throwing away good information. We know that Voter A isn’t that much more likely to participate than Voter D, but Voter A is counted at full weight, and Voter D isn’t included at all.

A larger problem arises however if there is some kind of systematic pattern in which voters tend to wind up in which buckets. For example, suppose that Gallup’s scoring is such that you wind up with something like this:

Voter M1 — Mature Voter — 65% chance of voting

Voter M2 — Mature Voter — 65% chance of voting

Voter M3 — Mature Voter — 65% chance of voting

Voter Y1 — Young Voter — 55% chance of voting

Voter Y2 — Young Voter — 55% chance of voting

In this case, the three mature voters would all be included in the model, while the two young voters would be dropped — even though there is a rather small difference in their respective likelihood of voting.

That was just a contrived example — but Gallup’s methodology could prove to be very problematic if there is any sort of Long Tail effect in voting patterns.

That is, suppose you have a small group of core voters who are nearly certain to vote, coupled with a larger group of non-core voters, any one of whom might not be all that likely to participate, but who collectively will make up a fairly large fraction of the electorate. If the voters toward the head of the curve tend toward one party (say, the Republicans), and the voters toward the tail of the curve tend toward another (say, the Democrats), you’re going to wind up with a skewed sample if you set an arbitrary cutoff point somewhere in between.