Mark Rank and Thomas Hirschl recently published an estimate that 50% of American kids are on food stamps at some point during their first twenty years of life. Their estimate is based on an analysis of data from the Panel Study of Income Dynamics, from 1968 through 1997.

This news article by Lindsey Tanner provides a good overview.

The findings are important–for one thing, they give a sense of how people’s economic status can fluctuate. But what I want to focus on here are some statistical issues, in particular the question of what makes a statistical estimate more or less trustworthy.

In the political media, but especially at 538, with the work of Nate and his colleagues, we see polls and economic analyses coming at us every week, and there’s always the question of how to build confidence in our numbers. On one hand, raw data from polls or elsewhere can be too raw to be useful (just ask President Kerry), but as our data analysis steps become too complicated, a legitimate worry arises that we’re extrapolating too far.

OK, back to the food stamp study.

followed up families annually, thus there are kids in the study who were included at age 1, 2, . . ., 20. From this you can easily just count the proportion who were never on food stamps, the proportion who were on food stamps for one year during the first 20 years of their lives, the proportion who were on food stamps for exactly two years, etc.

Rank and Hirschl don’t quite do this; instead they use all their data to estimate the probability of being on food stamps at age 1; then they use all the kids who were in the study for ages 1-2 to estimate the prob of being on food stamps at age 2, if they were not on food stamps at age 1; . . . and for their last step, they use the subset of kids who were in the study continuously for ages 1-20 to estimate the prob of being on food stamps at age 20, for kids who were not on food stamps for the first 19 years. Put these together and you can figure out the probability of ever having food stamps.

This is all fine–it’s an efficient use of the data they have–but I’d feel a bit more confidence in Rank and Hirschl’s estimates if they would cross-check by doing some raw-data calculations based on the subset of kids who were in the study continuously for ages 1-20. That’s a crucial component in any applied statistical analysis–the continuous thread connecting the raw numbers to the final estimate–and I always like to see it, especially for a politically-charged subject such as this one. But really this isn’t much different from my comment on the basketball halftime study: I’ll believe the fancy analysis a lot more if I see the connection to the data.

Here are the key results from the study:

(from Table 1): 12% of newborns were on food stamps. 49% of kids were on food stamps for at least one year between ages 1-20. 23% of kids were on food stamps for at least 5 years.

(from Table 2); 8% of white newborns and 33% of black newborns were on food stamps. 37% of white kids and 90% of black kids were on food stamps for at least one year between ages 1-20.

(from Table 3): Among the black kids of unmarried parents where the head of household did not graduate from high school, 99.6% were on food stamps.

Again, I don’t know how much to believe these numbers, but I assume that they’re not too far from what was really happening in those years. I’m not at all trying to say that Rank and Hirschl’s numbers are wrong, just that they’d be more believable if accompanied by a clear path connecting them to the raw data.

Also, whassup with those superfluous decimal places? “22.8%” and all the rest? Doesn’t anybody teach these people about sampling variation and significant digits? (I guess I should let them off the hook, given that the entire economics profession seems to have this problem too.)

P.S. To clarify: I’m not saying that a raw-data calculation would be *better* than Rank and Hirschl’s model-based analysis. What I’m saying is that I’d like to see the raw calculation, along with an explanation of any ways the estimate changed when the model was put in. The model may very well correct for biases and reduce variance; I’d just like to understand how that’s happening, rather than just have to take the numbers on faith.