Stats Can’t Tell Us Whether Mike Trout Or Josh Donaldson Should Be MVP

As MLB award season arrives, no prize looms larger than Most Valuable Player. From a statistical perspective, the best guide to the MVP award is undoubtedly wins above replacement, and some voters develop their MVP ballots at least in part based on WAR. If you were going just by WAR, the American League MVP should go to the Blue Jays’ Josh Donaldson (7.1 WAR) over the Angels’ Mike Trout (6.8 WAR); while in the National League, Bryce Harper (8.0 WAR) of the Nationals looks like an obvious choice (all stats are current through Sept. 13).

The problem is that we don’t know who truly has the most WAR in each league.

WAR looks like a single easy-to-understand stat, but it’s the product of a complex model. That model integrates information on all the ways a player provides value: his hitting, fielding, baserunning and (for pitchers) pitching. The WAR you find at, say, FanGraphs or Baseball-Reference.com is an estimate of all those categories combined.

However, like all statistical estimates, WAR calculations come with uncertainty. Here’s where things get pretty statsy, so bear with me: You’re about to get a crash course in confidence intervals. The true value of a player varies from what you find on the leaderboard — but we’re not sure by how much. That’s because of sample size. Although a whole season of baseball seems like a lot, it still doesn’t provide enough data to allow us to be completely sure of each player’s value. So Harper’s 8 WAR could be 6 or it could be 10, but the number on the leaderboard represents our best guess.

When the uncertainty about a player is small, we can be more sure that the player who looks like the best really is the best. If the uncertainty increases, though, we become less able to distinguish his performance from those of his competitors. Trying to determine the magnitude of this uncertainty is tricky, but it’s an important part of good statistical practice.

Confidence intervals help us establish how uncertain we are about our measurement. A confidence interval is a range that, based on statistical analysis, is thought to contain the true value of a player a certain percentage of the time.¹ But the creators of only one model have made their full methodology public, allowing us to create confidence intervals. Their model is, appropriately, called openWAR. I used openWAR to generate 1,000 fictional seasons that resemble the current year² but randomize the events in it.

Imagine that a player’s season (hitting, fielding and baserunning) is made up of just four plays, which we number 1 through 4. When we run the simulations, all of a player’s plays are randomized, so sometimes a season consists of plays 1 (a line drive the player hits for a single), 2 (a ball he commits an error on in the field), 3 (a stolen base) and 1 again; sometimes 1, 2, 3 and 3, etc. Each fictional season arises by picking at random from the plays that have happened, allowing the same play to be picked twice or more. From that, we can ask how often, in these imaginary seasons, a given player produced more or less WAR than it appears he did in the season we’ve just lived through in real life.³

Here’s what those simulations produced:

In 53.5 percent of the fictional seasons, Donaldson exceeds Trout. But in the other 46.5 percent, Trout betters Donaldson. With such an even split, it’s difficult to say for sure whether Trout or Donaldson has contributed more to his team’s success. If we use WAR as our guide, we have little basis to determine which player deserves the MVP more. The confidence interval on Donaldson’s WAR is simply too large to be sure that he’s the more valuable of the two. That’s not to say that Donaldson isn’t our best guess, only that we are not confident in that guess; Trout is a perfectly defensible choice as well.

In the NL, the gap is more pronounced between the WAR leader and the runner-up. Harper’s sublime offensive season has earned him the most WAR (8.0) in all of baseball. Meanwhile, the Mets’ Yoenis Cespedes comes in second, with a relatively paltry total of 7.1 WAR. Yet, despite the larger gap between Cespedes and Harper than between Trout and Donaldson, in 33.6 percent of the simulated seasons, Cespedes’s estimated WAR exceeds Harper’s.

Much of the uncertainty in WAR comes from our imperfect measurements of defense. Because the vast majority of defensive plays are routine, a player’s true defensive skill can be seen only on the few plays that are between the impossible and the everyday. A given player might see only about 100 such plays per season. Even in that subset of plays, the difference between a Web Gem and a hit can be as little as a couple of feet, the result of a lucky step or the wind changing at the right moment.

Even if we had much more data on defense than we do, our best tools for measuring it are falling behind front-office strategy. We watch baseball in the age of the defensive shift; teams are becoming more and more savvy about positioning both their infield and outfield to maximize the chance of getting outs. Without detailed data on how defenders are positioned before the play starts, most of our metrics are confounded by the front office’s ability to instruct its players on where to stand.

WAR itself is not complete. Although all versions of WAR available today cover the basics of player value (hitting, fielding, baserunning and pitching),⁴ no current version picks up on some of the more esoteric skills in baseball, such as a catcher’s pitch framing. Framing can contribute up to 3 wins to good catchers, so any MVP debate guided by WAR will underrate players like Buster Posey who contribute substantially in that arena.

Despite all of its flaws, WAR is still the best available tool for judging value, and certainly exceeds the older alternatives, such as RBIs and pitcher wins. At a minimum, WAR can tell us who the MVP isn’t. For example, we know Matt Kemp’s 2.0 WAR isn’t likely to best Harper’s even with all the uncertainty baked into WAR.⁵ But by old-school metrics, Kemp’s 94 RBIs as of Sept. 13 exceeded Harper’s 85. Because of sabermetrics, we know that has more to do with luck and context than value.

But like any tool, WAR has limitations. After a full season of baseball, our best measurement can’t tell us who the single most valuable player in the league is with any substantial degree of confidence. It could be Harper, or Cespedes, or potentially even Joey Votto. In the absence of definitive evidence, let the debate continue.

Footnotes

For example, a 95 percent confidence interval is thought to include the true value 95 percent of the time.
They are sampled from the current year, with replacement.
Again, all numbers are current through Sept. 13.
It is also worth considering that the different versions of WAR differ slightly in terms of how they measure and value these skills. For example, openWAR rewards contextual performance more than other versions of WAR, which arguably makes it more suitable for judging the MVP race. Although all versions agree in broad strokes and tend to give rise to similar leaderboards, differences between models can generate significant gaps between players (up to 2 to 3 WAR).
Such a scenario unfolds only three times in 1,000 seasons.

FiveThirtyEight

Stats Can’t Tell Us Whether Mike Trout Or Josh Donaldson Should Be MVP

Footnotes

Comments