How To Tell Good Studies From Bad? Bet On Them

A few years ago, psychologist Brian Nosek and two graduate students, Jeffrey Spies and Matt Motyl, conducted an experiment that measured people’s ability to differentiate shades of gray across a gradient. Nearly 2,000 participants of various political bents took part, and “the results were stunning,” the researchers wrote in a 2012 paper about the project.

Moderates more accurately perceived the various shades of gray than did extremists on either the left or right of the political spectrum. An analysis showed that the differences they found were statistically significant, with a p-value of 0.01,¹ and the researchers concluded that “political extremists perceive the world in black and white figuratively and literally.”

Motyl was about to enter the job market, and he now had a headline-worthy result to boost his chances of landing a career-advancing position. But before he claimed fame and glory, he and his colleagues paused to consider the possibility that their result was wrong.

With a p-value that low, the result hardly screamed “false positive” like a barely significant one of, say, 0.05 might, but they decided to take the unusual step of repeating the experiment while they wrote up the study for publication. They’d been thinking about the issue of reproducibility (the question of whether scientific results will stand up to further studies), and their methodology allowed them to easily collect data online. In the second trial, the results evaporated, and the researchers published the experience as a cautionary tale. Nosek and Spies went on to found the Center for Open Science to promote transparency and reproducibility in science.

Although replication is essential for verifying results, the current scientific culture does little to encourage it in most fields. That’s a problem because it means that misleading scientific results, like those from the “shades of gray” study, could be common in the scientific literature. Indeed, a 2005 study claimed that most published research findings are false.

There’s a growing recognition of the problem, but replication takes time and money, and with funding for science at a premium, there’s an urgent need to prioritize.

Today in the Proceedings of the National Academy of Sciences, Nosek and an international team of researchers present a tool for doing that — betting. They found that compared to simply asking experts to predict the likelihood that studies will be reproduced, asking them to bet money on the outcomes improved the accuracy of the guesses.

The researchers began by selecting some studies slated for replication in the Reproducibility Project: Psychology — a project that aimed to reproduce 100 studies published in three high-profile psychology journals in 2008. They then recruited psychology researchers to take part in two prediction markets. These are the same types of markets that people use to bet on who’s going to be president. In this case, though, researchers were betting on whether a study would replicate or not.

Before each prediction market began, participants (47 actively took part in the first market, 45 traded in the second) were asked two questions: How likely do you think it is that each hypothesis in this market will be replicated, and how well do you know this topic?

They were then given points worth a total of $100 to bet on whether the studies in their prediction market would replicate. A replication was considered successful if it produced a result, with a p-value of less than 0.05, in the same direction as the original result. Players entered the market with 10,000 points each and could buy and sell contracts for each hypothesis. If a replication succeeded, then its share paid 100, but if the replication failed, then it paid nothing. “If you believe the result will be replicated, you buy the contract, which increases the price,” said the study’s lead author, Anna Dreber, an economist at the Stockholm School of Economics. “If you don’t believe in a study, then you can short-sell it.”

A study’s final share price when the market closed was akin to an estimated probability that the study would replicate successfully, Nosek said. “A price of 75 indicates that the market perceived a 75 percent likelihood of replication success,” he said.

The prediction market correctly called nearly three-quarters (71 percent) of the attempted replications, 39 percent of which succeeded in the reproducibility project. By comparison, the survey conducted before the market began correctly predicted the result of only 58 percent of the replication studies. The prediction market anticipated a finding’s reproducibility better than asking the same bunch of experts to put their best guesses in a hat.

Why? “The beauty of the market is that we allow people to be Bayesian,” Dreber said. People come in with some prior belief, but they can also follow prices to see what other people believe and may update their beliefs accordingly. While the survey required everyone to provide an estimate for every study, participants in the market could focus their bets on the studies they felt most sure of, and as a result, rough guesses didn’t skew the averages as much. Finally, putting money at stake motivated people to try harder to find the right answer and reveal what they really think. “It’s really putting your money where your mouth is,” Dreber said. “You want to see what people do, not what they say.”

Prediction markets can’t provide a final word on a scientific study’s validity, but they might provide a useful gut-check to identify studies that are unlikely to hold up. And that could help researchers and funders prioritize where to focus replication efforts, Nosek said.

The current study is a nice demonstration of the market concept, said George Mason University economist Robin Hanson, whose 25-year-old paper “Could Gambling Save Science?” inspired the current project. But for prediction markets to get a foothold and make a difference in science, they’ll need to be implemented in a way that gets scientists’ attention, he said. “It has to connect back to the things that academics care about — jobs, publications and grants,” he said.

For example, a journal might decide to use markets to vet studies before publication. “If the market says, ‘Yeah, that’s cute, but that’s probably bogus,’ then that’s probably not something you should publish,” Hanson said. Since it’s too costly and complicated to replicate every study, another approach might be to enter 100 studies into a prediction market and then select 10 at random to replicate. Presumably, the threat of a replication would create an incentive for researchers to be more careful.

Read more: Science Isn’t Broken

Footnotes

A p-value is the probability of getting a result at least as extreme as the one you saw if your hypothesis is false. By convention, a p-value less than or equal to 0.05 is considered “statistically significant” in psychology. The lower the p-value, the better.

FiveThirtyEight

How To Tell Good Studies From Bad? Bet On Them

Footnotes

Comments