Psychology’s Replication Crisis Has Made The Field Better

In 2012, psychologists Will Gervais and Ara Norenzayan published a paper in the journal Science reporting a series of experiments that suggested engaging in analytical thinking could reduce someone’s religious belief. It sounded vaguely plausible, but five years later, another group of researchers attempted to replicate the finding.¹ They used a sample size about two and a half times larger and found no evidence that analytic thinking caused a decrease in religious belief.

“Is it fun to find out that a study you published in a high profile outlet back in the day does not hold up well to more rigorous scrutiny? Oh hell no,” Gervais wrote in a blog post. He recommended that other researchers avoid the experience by using rigorous methods upfront. “CHECK YOURSELF BEFORE YOU WRECK YOURSELF while being open to REVISING YOUR BELIEFS.”

Being open to revising beliefs in the face of new evidence is, of course, a central tenet of the scientific enterprise. But research is done by fallible human beings who don’t always live up to scientific principles. When psychology first entered a period of upheaval commonly referred to as the “replication crisis,” not everyone in the field shared Gervais’s openness to updating. But as the field has reckoned with replication issues, it has been forced to follow its own rules more closely.

The replication crisis arose from a series of events that began around 2011, the year that social scientists Uri Simonsohn, Leif Nelson and Joseph Simmons published a paper, “False-Positive Psychology,” that used then-standard methods to show that simply listening to the Beatles song “When I’m Sixty-Four” could make someone younger. It was an absurd finding, and that was the point. The paper highlighted the dangers of p-hacking — adjusting the parameters of an analysis until you get a statistically significant p-value (a difficult-to-understand number often misused to imply a finding couldn’t have happened by chance) — and other subtle or not-so-subtle ways that researchers could tip the scales to produce a favorable result. Around the same time, other researchers were reporting that some of psychology’s most famous findings, such as the idea that “priming” people by presenting them with stereotypes about elderly people made them walk at a slower pace, were not reproducible.

A lot has happened since then. I’ve been covering psychology’s replication problem for FiveThirtyEight since 2015, and in that time, I’ve seen a culture change. “If a team of research psychologists were to emerge today from a 7-year hibernation, they would not recognize their field,” Nelson and his colleagues wrote in the journal Annual Reviews last year. What has changed? Authors are voluntarily posting their data, replication attempts are published in top journals, and researchers are increasing their sample sizes and committing to data collection and analysis plans in advance.

An approach called preregistration helps to prevent p-hacking and other flexible analysis practices by forcing researchers to commit to a methods-and-analysis plan in advance. Instead of playing around with different ways of parsing the data to find what they’re looking for, researchers have to specify (and justify) how they’ll analyze the data before they collect it. If the results aren’t what they were hoping for, so be it. “Preregistration is not a magic bullet, but it’s pretty close if you’re worried about p-hacking,” Simonsohn said.

In a related approach called registered reports, journals assess preregistrations and accept (or reject) a research group’s paper based on the rigor of their methods and analytical methods, rather than on what they found. More than 140 journals now use registered reports in some form, up from zero just six years ago, and at least 150 registered reports have been published so far. An analysis released on the PsyArXiv repository in October looked at more than 100 registered reports and found that overall 61 percent of them were null results — they didn’t support the hypothesis being tested. That’s “in stark contrast to the estimated 5-20% of null findings in the traditional literature,” the researchers wrote. Although that suggests that registered reports might cut down on the bias toward publishing only positive results, it’s only one analysis and far from definitive.

Along with a renewed sense of openness — researchers now regularly make their data available for others to assess — has come what some psychologists have dubbed a “cooperative revolution” and the advent of a new type of study: large-scale collaborations across multiple labs and multiple countries. An example of this is the Psychological Science Accelerator, which runs studies simultaneously at labs around the globe to produce bigger data sets with a more diverse pool of study subjects than if a study were done in just one place.

Large-scale replication projects like Estimating the Reproducibility of Psychological Science, ManyLabs and the Reproducibility Project: Cancer Biology are helping researchers identify sources of irreproducibility and improve their research methods. One of the latest such studies showed that so-called “hidden moderators” — unidentified differences between how various labs conducted a study or differences between study populations — were less important than some researchers had argued. The large-scale replication project published over the summer that failed to replicate Gervais’s study also suggested that statistical power (whether a study had a sample size large enough to detect an effect if it were there) was an important factor in predicting whether a study could be reproduced, lending support to the idea that small studies that report statistically significant results are prone to being wrong.

Although there’s been an uptick in collaborations like these, it hasn’t all been smooth sailing. Back when he was working on the first large-scale replication project, Brian Nosek, a psychologist at the University of Virginia and co-founder of the Center for Open Science, which provides tools and infrastructure for researchers to preregister studies and share their data openly with one another, received anonymous emails asking him not to do the project. “We had some people respond with hostility,” he said. Nosek didn’t receive any such pushback during his group’s most recent replication studies, but there are still some researchers who aren’t happy with the new emphasis on methodology and the public airing of problems within the field. It’s human nature to feel defensive when your work is called into question, and there’s been disagreement, sometimes had on social media, over where to draw the line between constructive criticism and meanness.

“One of the most important changes I’ve seen in the field as a collective has been that it’s become more self-reflective,” said Farid Anvari, a psychology postdoctoral researcher at Eindhoven University of Technology in the Netherlands. “People are asking the question: How can we do better?”

These kinds of norm changes are inching the field forward, and it’s not surprising that there has been some conflict along the way, said Katie Corker, a psychologist at Grand Valley State University and a member of the Society for the Improvement of Psychological Science. “The open science movement is also a social/political movement,” she said. “Changing norms and incentives means changing the structure of how people do their work and get ahead in their jobs. We are changing the process of how science is done, which is no small thing.” Nosek said he’s seeing a real sense of intellectual humility. “We’re not here to be right,” he said. “We’re here to get it right.”

It’s too soon to know for sure whether the changes that have happened so far have made the science more reliable, said Alison Ledgerwood, a psychologist at the University of California, Davis, and the founder of PsychMAP, a Facebook group where researchers discuss methodology. Still, she’s optimistic: “My utopia is that we come to see science as a process, not an answer.”

Footnotes

As part of the large-scale reproducibility project that this replication was a part of, researchers also asked other researchers in the field to forecast which studies would replicate, and they correctly predicted that the Gervais and Norenzayan study would fail to replicate.

Footnotes

Comments