Last week, I made the case that science isn’t broken, it’s just hard. Today in the journal Science, a group of more than 350 researchers published a paper demonstrating the challenges of a core step in scientific inquiry: replication.
The project, spearheaded by the Open Science Collaboration, aimed to replicate 100 studies published in three high-profile psychology journals during 2008. The idea arose amid a growing concern that psychology has a false-positive problem: In recent years, important findings in the field have been called into question when follow-up studies failed to replicate them, hinting that the original studies may have mistaken spurious effects for real ones.
“The idea was to see whether there was a reproducibility problem, and if so, to stimulate efforts to address it,” project leader Brian Nosek told me. In total, 270 co-authors and 86 volunteers contributed to the effort.
This wasn’t a game of “gotcha.” “Failing to replicate does not mean that the original study was wrong or even flawed,” Nosek told me, and the objective here wasn’t to overturn anyone’s results or call out particular studies. The project was designed to conduct fair and direct replications, Nosek said. “Before we began, we tried to define a protocol to follow so that we could be confident that every replication we did had a fair chance of success.” Before embarking on their studies, replicators contacted the original authors and asked them to share their study designs and materials. Almost all complied.
Researchers who conducted the replication studies also asked the original authors to scrutinize the replication plan and provide feedback, and they registered their protocols in advance, publicly sharing their study designs and analysis strategies. “Most of the original authors were open and receptive,” project coordinator Mallory Kidwell told me.
Despite this careful planning, less than half of the replication studies reproduced the original results. While 97 percent of the original studies produced results with a “statistically significant” p-value of 0.05 or less,1 only 36 percent of the replication studies did the same. The mean effect sizes in the replicated results were less than half those of the original results, and 83 percent of the replicated effects were smaller than the original estimates.
These replication studies can’t explain why any particular finding was not reproduced, but there are three general possibilities, Nosek said. The originally reported result could have been a false positive, the replication attempt may have produced a false negative (failing to find an effect where one does exist), or the original study and the replication could both be correct but arrive at disparate results because of differences in methodology or conditions that weren’t apparent.
The best predictor of replication success, Nosek told me, was the strength of the original evidence, as measured by factors such as the p-value. Yes, the p-value — that notoriously misleading statistic. This study suggests that p-values can provide useful information, Nosek said. “If it was good for nothing, it wouldn’t have shown any predictive value at all for reproducibility.”
At the same time, this project’s results serve as a stark reminder that the 0.05 threshold for p-values is arbitrary. “What it suggests is that when we get a p-value of 0.04, we should be more skeptical than when we get a lower value,” Nosek said.
And, of course, the study provides insights into the prevalence of false positives in psychology research. “I think this is probably about the best estimate that we have at this point,” said Norbert Schwarz, a psychologist at the University of Southern California who was not involved in the project and calls himself a “skeptical observer of the ‘replication movement.’” Schwarz has criticized some previous replication attempts but says this one is “a much more credible and serious attempt than these often haphazard replications driven by a certain vigilante spirit.”
While the results don’t suggest that psychology is a bunch of rubbish, the project does document that psychology has a problem, University of Oxford psychologist Dorothy Bishop told me. “We shouldn’t be sitting around saying who’s to blame, but taking steps to counteract the difficulties.”
Registered reports2 are one approach to reducing reproducibility problems that has been adopted at more than a dozen journals. In a registered study, protocols, methods and data analysis plans are registered and peer-reviewed before data is collected.
“It does make your science much better, because you’re forced to think through your protocol in advance, and you get refereeing and all these useful suggestions before it’s too late,” Bishop said. Committing to methods in advance can also cut down on p-hacking — adjusting the parameters of the analysis until you get a statistically significant p-value — and other intentional or inadvertent post hoc tinkering that may lead to false positives.
This replication project followed a registered reports model. Getting a clear picture of what the original authors did was the biggest hurdle that replicators faced, project coordinator Johanna Cohoon told me, so adopting more transparent methods could make future replication efforts easier.
While some people may look at the high rate of failure to replicate in this study and wonder whether psychology is bunk, the researchers I spoke with took a different view. “Good colleagues that I respect a lot think the sky is falling down, but I don’t agree,” said Elizabeth Gilbert, a University of Virginia graduate student who did one of the studies that failed to replicate the original. Some published results probably are false, she said, but in many cases, the original study may simply have overestimated the effect size, or the phenomenon is found only under particular conditions that haven’t yet been identified. Digging into these problems doesn’t weaken psychology, she says, it strengthens it. “I’m really proud of our field for trying to push forward to make our science better.”