What are you trying to say
— Stephen Ziliak, Roosevelt University economics professor
How many statisticians does it take to ensure at least a 50 percent chance of a disagreement about p-values? According to a tongue-in-cheek assessment by statistician George Cobb of Mount Holyoke College, the answer is two … or one. So it’s no surprise that when the American Statistical Association gathered 26 experts to develop a consensus statement on statistical significance and p-values, the discussion quickly became heated.
It may sound crazy to get indignant over a scientific term that few lay people have even heard of, but the consequences matter. The misuse of the p-value can drive bad science (there was no disagreement over that), and the consensus project was spurred by a growing worry that in some scientific fields, p-values have become a litmus test for deciding which studies are worthy of publication. As a result, research that produces p-values that surpass an arbitrary threshold are more likely to be published, while studies with greater or equal scientific importance may remain in the file drawer, unseen by the scientific community.
The results can be devastating, said Donald Berry, a biostatistician at the University of Texas MD Anderson Cancer Center. “Patients with serious diseases have been harmed,” he wrote in a commentary published today. “Researchers have chased wild geese, finding too often that statistically significant conclusions could not be reproduced.” Faulty statistical conclusions, he added, have real economic consequences.
“The p-value was never intended to be a substitute for scientific reasoning,” the ASA’s executive director, Ron Wasserstein, said in a press release. On that point, the consensus committee members agreed, but statisticians have deep philosophical differences about the proper way to approach inference and statistics, and “this was taken as a battleground for those different views,” said Steven Goodman, co-director of the Meta-Research Innovation Center at Stanford. Much of the dispute centered around technical arguments over frequentist versus Bayesian methods and possible alternatives or supplements to p-values. “There were huge differences, including profoundly different views about the core problems and practices in need of reform,” Goodman said. “People were apoplectic over it.”
The group debated and discussed the issues for more than a year before finally producing a statement they could all sign. They released that consensus statement on Monday, along with 20 additional commentaries from members of the committee. The ASA statement is intended to address the misuse of p-values and promote a better understanding of them among researchers and science writers, and it marks the first time the association has taken an official position on a matter of statistical practice. The statement outlines some fundamental principles regarding p-values.
Among the committee’s tasks: Selecting a definition of the p-value that nonstatisticians could understand. They eventually settled on this: “Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.” That definition is about as clear as mud (I stand by my conclusion that even scientists can’t easily explain p-values), but the rest of the statement and the ideas it presents are far more accessible.
One of the most important messages is that the p-value cannot tell you if your hypothesis is correct. Instead, it’s the probability of your data given your hypothesis. That sounds tantalizingly similar to “the probability of your hypothesis given your data,” but they’re not the same thing, said Stephen Senn, a biostatistician at the Luxembourg Institute of Health. To understand why, consider this example. “Is the pope Catholic? The answer is yes,” said Senn. “Is a Catholic the pope? The answer is probably not. If you change the order, the statement doesn’t survive.”
A common misconception among nonstatisticians is that p-values can tell you the probability that a result occurred by chance. This interpretation is dead wrong, but you see it again and again and again and again. The p-value only tells you something about the probability of seeing your results given a particular hypothetical explanation — it cannot tell you the probability that the results are true or whether they’re due to random chance. The ASA statement’s Principle No. 2: “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”
Nor can a p-value tell you the size of an effect, the strength of the evidence or the importance of a result. Yet despite all these limitations, p-values are often used as a way to separate true findings from spurious ones, and that creates perverse incentives. When the goal shifts from seeking the truth to obtaining a p-value that clears an arbitrary threshold (0.05 or less is considered “statistically significant” in many fields), researchers tend to fish around in their data and keep trying different analyses until they find something with the right p-value, as you can see for yourself in a p-hacking tool we built last year.
Indeed, many of the ASA committee’s members argue in their commentaries that the problem isn’t p-values, just the way they’re used — “failing to adjust them for cherry picking, multiple testing, post-data subgroups and other biasing selection effects,” as Deborah Mayo, a philosopher of statistics at Virginia Tech, puts it. When p-values are treated as a way to sort results into bins labeled significant or not significant, the vast efforts to collect and analyze data are degraded into mere labels, said Kenneth Rothman, an epidemiologist at Boston University.
The 20 commentaries published with the ASA statement present a range of ideas about where to go from here. Some committee members argued that there should be a move to rely more on other measures, such as confidence intervals or Bayesian analyses. Others felt that switching to something else would only shift the problem around. “The solution is not to reform p-values or to replace them with some other statistical summary or threshold,” wrote Columbia University statistician Andrew Gelman, “but rather to move toward a greater acceptance of uncertainty and embracing of variation.”
If there’s one takeaway from the ASA statement, it’s that p-values are not badges of truth and p < 0.05 is not a line that separates real results from false ones. They’re simply one piece of a puzzle that should be considered in the context of other evidence.
This story began with a haiku from one of the p-value document’s companion responses; let’s end it with a limerick by University of Michigan biostatistician Roderick Little.
In statistics, one rule did we cherish:
P point oh five we publish, else perish!
Said Val Johnson, “that’s out of date, Our studies don’t replicate
P point oh oh five, then null is rubbish!”
CORRECTION (March 7, 11:05 a.m.): An earlier version of this article misstated the university where Deborah Mayo is a professor. She teaches at Virginia Tech, not the University of Pennsylvania.
Even the Supreme Court has weighed in, unanimously ruling in 2011 that statistical significance does not automatically equate to scientific or policy importance.