At first blush, the studies look reasonable enough. Low-intensity stretching seems to reduce muscle soreness. Beta-alanine supplements may boost performance in water polo players. Isokinetic strength training could improve swing kinematics in golfers. Foam rollers can reduce muscle soreness after exercise.
The problem: All of these studies shared a statistical analysis method unique to sports science. And that method is severely flawed.
The method is called magnitude-based inference, or MBI. Its creator, Will Hopkins, is a New Zealand exercise physiologist with decades of experience — experience that he has harnessed to push his methodology into the sports science mainstream. The methodology allows researchers to find effects more easily compared with traditional statistics, but the way in which it is conducted undermines the credibility of these results. That MBI has persisted as long as it has points to some of science’s vulnerabilities — and to how science can correct itself.
MBI was created to address an important problem. Science is hard, and sports science is particularly so. If you want to study, say, whether a sports drink or training method can improve athletic performance, you have to recruit a bunch of volunteers and convince them to come into the lab for a battery of time- and energy-intensive tests. These studies require engaged and, in many cases, highly fit athletes who are willing to disrupt their lives and normal training schedules to take part. As a result, it’s not unusual for a treatment to be tested on fewer than 10 people. Those small samples make it extremely difficult to distinguish the signal from the noise and even harder to detect the kind of small benefits that in sport could mean the difference between a gold medal and no medal at all.
Hopkins’s workaround for all of this, MBI, has no sound theoretical basis. It is an amalgam of two statistical approaches — frequentist and Bayesian — and relies on opaque formulas embedded in Excel spreadsheets1 into which researchers can input their data. The spreadsheets then calculate whether an observed effect is likely to be beneficial, trivial or harmful and use statistical calculations such as confidence intervals and effect sizes to produce probabilistic statements about a set of results.
In doing so, those spreadsheets often find effects where traditional statistical methods don’t. Hopkins views this as a benefit because it means that more studies turn up positive findings worth publishing. But others see it as a threat to sports science’s integrity because it increases the chances that those findings aren’t real.
A 2016 paper by Hopkins and collaborator Alan Batterham makes the case that MBI is superior to the standard statistical methods used in the field. But I’ve run it by about a half-dozen statisticians, and each has dismissed the pairs’ conclusions and the MBI method as invalid. “It’s basically a math trick that bears no relationship to the real world,” said Andrew Vickers, a statistician at Memorial Sloan Kettering Cancer Center. “It gives the appearance of mathematical rigor,” he said, by inappropriately combining two forms of statistical analysis using a mathematical oversimplification.
When I sent the paper to Kristin Sainani, a statistician at Stanford University, she got so riled up that she wrote a paper in Medicine & Science in Sports & Exercise (MSSE) outlining the problems with MBI. Sainani ran simulations showing that what MBI really does is lower the standard of evidence and increase the false positive rate. She details how this works in a 50-minute video; the chart below shows how these flaws play out in practice.
To highlight Sainani’s findings, MSSE commissioned an accompanying editorial,2 written by biostatistician Doug Everett, that said MBI is flawed and should be abandoned. Hopkins and his colleagues have yet to provide a sound theoretical basis for MBI, Everett told me. “I almost get the sense that this is a cult. The method has a loyal following in the sports and exercise science community, but that’s the only place that’s adopted it. The fact that it’s not accepted by the wider statistics community means something.”
How did this problematic method take hold among the sports science research community? In a perfect world, science would proceed as a dispassionate enterprise, marching toward truth and more concerned with what is right than with who is offering the theories. But scientists are human, and their passions, egos, loyalties and biases inevitably shape the way they do their work. The history of MBI demonstrates how forceful personalities with alluring ideas can muscle their way onto the stage.
The first explanation of MBI in the scientific literature came in a 2006 commentary that Hopkins and Batterham published in the International Journal of Sports Physiology and Performance. Two years later, it was rebutted in the same journal, when two statisticians said MBI “lacks a proper theoretical foundation” within the common, frequentist approach to statistics.
But Batterham and Hopkins were back in the late 2000s, when editors at Medicine & Science in Sports & Exercise (the flagship journal of the American College of Sports Medicine) invited them and two others to create a set of statistical guidelines for the journal. The guidelines recommended MBI (among other things), but the nine peer reviewers failed to reach a unanimous decision to accept the guidelines. Andrew Young, then editor in chief of MSSE, told me that their concerns weren’t only about MBI — some reviewers “felt the recommendations were too rigid and would be interpreted as rules for authors” — but “all reviewers expressed some concerns that MBI was controversial and not yet accepted by mainstream statistical folks.”
Young published the group’s guidelines as an invited commentary with an editor’s note disclosing that although most of the reviewers recommended publication of the article, “there remain several specific aspects of the discussion on which authors and reviewers strongly disagreed.” (In fact, three reviewers objected to publishing them at all.)3
Hopkins and Batterham continued to press their case from there. After Australian statisticians Alan Welsh and Emma Knight published an analysis of MBI in MSSE in 2014 concluding that the method was invalid and should not be used, Hopkins and Batterham responded with a post at Sportsci.org,4 “Magnitude-Based Inference Under Attack.” They then wrote a paper contending that “MBI is a trustworthy, nuanced alternative” to the standard method of statistical analysis, null-hypothesis significance testing. That paper was rejected by MSSE. (“I put it down to two things,” Hopkins told me of MBI critics. “Just plain ignorance and stupidity.”) Undeterred, Hopkins submitted it to the journal Sports Medicine and said he “groomed” potential peer reviewers in advance by contacting them and encouraging them to “give it an honest appraisal.” The journal published it in 2016.
Which brings us to the last year of drama, which has featured a preprint on SportRxiv criticizing MBI, Sainani’s paper and more responses from Batterham and Hopkins, who dispute Sainani’s calculations and conclusions in a response at Sportsci.org titled “The Vindication of Magnitude-Based Inference.”5
Has all this back and forth given you whiplash? The papers themselves probably won’t help. They’re mostly technical and difficult to follow without a deep understanding of statistics. And like researchers in many other fields, most sports scientists don’t receive extensive training in stats and may not have the background to fully assess the arguments getting tossed around here. Which means the debate largely turns on tribalism. Whom are you going to believe? A bunch of statisticians from outside the field, or a well-established giant from within it?
For a while, Hopkins seemed to have the upper hand. That 2009 MSSE commentary touting MBI that was published despite reviewers’ objections has been cited more than 2,500 times, and many papers have used it as evidence for the MBI approach. Hopkins gives MBI seminars, and Victoria University offers an Applied Sports Statistics unit developed by Hopkins that has been endorsed by the British Association of Sport and Exercise Sciences and Exercise & Sports Science Australia.
“Will is a very enthusiastic man. He’s semi-retired and a lot older than most of the people he’s dealing with,” Knight said. She wrote her critique of MBI after becoming frustrated with researchers at the Australian Institute of Sport (where she worked at the time) coming to her with MBI spreadsheets. “They all very much believed in it, but nobody could explain it.”
These researchers believed in the spreadsheets because they believed in Hopkins — a respected physiologist who speaks with great confidence. He sells his method by highlighting the weaknesses of p-values and then promising that MBI can direct them to the things that really matter. “If you have very small sample sizes, it’s almost impossible to find statistical significance, but that doesn’t mean the effect isn’t there,” said Eric Drinkwater, a sports scientist at Deakin University in Australia who studied for his Ph.D. under Hopkins. “Will taught me about a better way,” he said. “It’s not about finding statistical significance — it’s about the magnitude of the change and is the effect a meaningful result.” (Drinkwater also said he is “prepared to accept that this is a controversial issue” — and perhaps will go with traditional measures such as confidence limits and effect sizes rather than using MBI.)
It’s easy to see MBI’s appeal beyond Hopkins, too. It promises to do the impossible: detect small effects in small sample sizes. Hopkins points to legitimate discussions about the limits of null-hypothesis significance testing as evidence that MBI is better. But this selling point is a sleight of hand. The fundamental problem it’s trying to tackle — gleaning meaningful information from studies with noisy and limited data sets — can’t be solved with new statistics. Although MBI does appear to extract more information from tiny studies, it does this by lowering the standard of evidence.
That’s not a healthy way to do science, Everett said. “Don’t you want it to be right? To call this ‘gaming the system’ is harsh, but that’s almost what it seems like.”
Sainani wonders, what’s the point? “Does just meeting a criteria such as ‘there’s some chance this thing works’ represent a standard we ever want to be using in science? Why do a study at all if this is the bar?”
Even without statistical issues, sports science faces a reliability problem. A 2017 paper published in the International Journal of Sports Physiology and Performance pointed to inadequate validation that surrogate outcomes really reflect what they’re meant to measure, a dearth of longitudinal and replication studies, the limited reporting of null or trivial results, and insufficient scientific transparency as other problems threatening the field’s reliability and validity.
All the back-and-forth arguments about error rate calculations distract from even more important issues, said Andrew Gelman, a statistician at Columbia University who said he agrees with Sainani that the paper claiming MBI’s validity “does not make sense.” “Scientists should be spending more time collecting good data and reporting their raw results for all to see and less time trying to come up with methods for extracting a spurious certainty out of noisy data.” To do that, sports scientists could work collectively to pool their resources, as psychology researchers have done, or find some other way to increase their sample sizes.
Until they do that, they will be engaged in an impossible task. There’s only so much information you can glean from a tiny sample.
CORRECTION (May 16, 2018, 12:30 p.m.): A previous version of this article said Hopkins submitted his defense of MBI to Sports Science. That paper was submitted to the journal Sports Medicine, which published it in 2016.