It is no secret that the United States has a weight problem. Roughly 30 percent of American adults are clinically obese, or have a body mass index of at least 30. That’s more than 175 pounds for someone who’s 5 foot 4, the average height of an American woman; or more than 203 pounds for someone who’s 5 foot 9, the average height of an American man. Obesity is associated with a whole host of health issues — diabetes, sleep apnea, stroke, heart attack and on and on — meaning that for many of us, diet is the real killer.

Obesity rates have risen over time, especially in children. The disease we now refer to as Type 2 diabetes used to be called “adult-onset diabetes.” One of the main reasons for the name change is that the onset became increasingly common in children, caused by obesity.

With this backdrop, the headlines last month about declines in childhood obesity were remarkable and encouraging. Here’s how The New York Times described it: “Obesity Rate in Young Children Plummets 43% in a Decade.” This and other articles went on to describe a huge decrease — from 13.9 percent to 8.4 percent — in the obesity rate for children between the ages of 2 and 5. The articles cited all sorts of reasons for the drop, including that kids are drinking less soda and that more mothers are engaged in breastfeeding. First lady Michelle Obama’s push for kids to exercise more and eat healthier foods also got credit.

You could probably challenge all of these explanations, but as an economist trained in examining health care statistics, I was curious to answer a more basic question: Is this decline in childhood obesity real? Are toddlers in the U.S. any less obese than they were a decade ago?

The original paper, which was published in JAMA, The Journal of the American Medical Association, is a relatively straightforward analysis of the most recent National Health and Nutrition Study Examination Survey, or NHANES. (If you’re curious, you can see the raw data here.) Conducted every two years, the federal health survey directly measures respondents’ weight and height, as opposed to relying on self-reported data, which can be unreliable. The JAMA paper’s authors summarize obesity rates for adults and children from the 2011/2012 survey, and compare them to rates from an earlier 2003/2004 survey.

The overall obesity rate among those surveyed over the 10-year period was unchanged. The rates were also flat or increasing in virtually all subgroups: unchanged among children overall (ages 2 to 19); unchanged among school-age children (6 to 19); unchanged among working-age adults (20 to 59), and slightly increased among older adults (over 60). The only group that showed a notable decrease was children ages 2 to 5.

This paints a slightly different and perhaps less encouraging picture than the one the headlines suggested. At the same time, this very young cohort may be one we care a lot about, since habits of young children can last a lifetime, and the change in this group was large. But again, was it real?

To answer this question, it’s useful to understand the data underlying these results. Imagine you wanted to know exactly how much obesity had changed in the U.S. from 2003 to 2011. The only way you could know for sure would be to weigh every person — all 300-some million of us — in 2003 and then again in 2011. With this information, you could say how much obesity had changed overall, and in each age group.

But of course, this approach is impossible. Instead, the way we judge trends in obesity (or unemployment, or the size of the labor force, or lots of other things) is to survey a random sample of Americans and then draw conclusions about the overall U.S. population. The NHANES survey, the data on which this study is based, covers about 10,000 individuals.

The value of choosing the sample randomly is that, on average, the experience of the sample will be the same as the experience of the overall U.S. population. But because the sample is, after all, still a sample and doesn’t include everyone, it has some noise — some error. Even if there is no change in the overall weight of Americans, there will always be some small changes in the observed sample.

Researchers often use something called a “t-test” to examine how much of a studied trend is attributable to real changes in the population, as opposed to noise and chance in the sample. The t-test produces another statistic called a “p-value,” which tells you the statistical significance of the observed trend. Typically p-values below .05 suggest statistical significance.

The statistics up to this point are pretty straightforward, but when you start looking at smaller groups within the data — in this case, say, 1-year-olds, 2-year-olds and so on — things get a little hairier.

If you look at each age group separately, you’re likely to find large changes in some of them. That’s because as you select smaller and smaller groups within a data set, your problem with noise worsens. For example, say you survey 81 people who are 34 years old in 2003, and 85 people age 34 in 2011. What if you just happen to pick a particularly obese group in 2003 and a thinner group in 2011? With just 80-some people, it’s easy to see how this could happen, and yet you might inappropriately conclude that 34-year-olds in the broader U.S. population have become less obese.

One way to prevent yourself from drawing this erroneous conclusion is to apply a more rigorous t-test to the data, one that takes into account the fact that you looked at many subsets. In the JAMA paper, the authors did not do this. They tested for statistical differences within their smaller age groups, but interpreted their p-values relative to the standard significance level of .05. In the case of the 2- to 5-year-olds, the authors reported a p-value of .03. If this were an analysis of the overall NHANES, that p-value *should* be taken to indicate a significant change. But since the authors are slicing the data and doing multiple tests — they consider at least six different age groups — the statistics need to be adjusted to account for that. With that adjustment, they’d need a p-value of less than .008 to conclude an effect is significant.^{1}

There’s another way to figure out if your conclusions reflect real trends in the population: Get more data. Again, let’s say you saw a big change in obesity among 34-year-olds surveyed in 2003 and 2011. If this change were real — if it really reflected changes in obesity among 34-year-olds in the entire U.S. population — you should see it persist when surveying another group of people two years later, in 2013. On the other hand, if the change weren’t real — if it just reflected statistical error — then you should see the obesity rate revert toward the 2003 level in 2013.

So to judge whether the change in the obesity rate was in fact statistically significant, I replicated what the researchers did in their JAMA paper with NHANES data from an earlier period — the 1999/2000 and 2007/2008 surveys. Then I could see whether any trends during that period persisted in the subsequent 2009/2010 survey. I analyzed changes in obesity rates, either up or down, for all four-year age groups (2- to 5-year-olds, 6- to-9-year-olds, and so on, up to 79- to 82-year-olds) over this decade.

I then identified two groups in the data that showed changes in obesity rates by the .05 significance standard applied in this paper.^{2} Both groups — 6- to 9-year-olds and 18- to 21-year-olds — showed increases in obesity over the 10-year period.

But what happens to their obesity rates in the following 2009/2010 survey? They decline. The change in obesity rates between the 1999/2000 and 2009/2010 surveys is not statistically significant in either group. The apparent increases in the earlier period were just noise in the data, produced by the fact that I went through and picked two groups with significant changes.^{3}

So what does this have to do with the decline in toddler obesity that the JAMA paper’s authors reported last month? In light of the discussion above, we have to ask whether it’s just one of the spurious significant changes you’re *always* going to find if you look at small enough groupings within a data set.

Unfortunately, it seems possible that it is. Although my analysis certainly doesn’t mean the study’s conclusions are wrong, it does raise questions about whether the drop in obesity in the sample is representative of a drop in obesity among 2- to 5-year-olds in the U.S. overall. To believe that this is a consistent, long-lasting change, we would want to subject the data to the more rigorous statistical testing I describe above. And on that standard, it fails. It may be that the health of American toddlers is improving — and I hope it is — but based on the NHAHES data, there’s no strong reason to think this is the case.