The Imperfect Pursuit of a Perfect Baseball Forecast

As you navigate through the sports media this week, chances are you’ll see at least one set of predictions for the upcoming Major League Baseball season. This preseason forecasting game was being played as far back as 1929, and the prediction options fans have at their disposal today are more plentiful — and sophisticated — than ever before.

But they can never be perfect. There’s a statistical limit to how accurate any projection about a team can be in the long run. Years ago, sabermetrician Tom Tango researched the amount of talent and luck that go into team winning percentages and found that chance explains one-third of the difference between two teams’ records. That makes it hard to predict how many times a team will win over a season. The smallest possible root-mean-square error (a mathematical way of testing a prediction’s accuracy) for any projection system over an extended period of time is 6.4 wins. In a single season, forecasters can — and do — beat an RMSE of 6.4. But whenever that happens, it’s due to luck. The amount of random variance that goes into team records makes the 6.4 barrier literally impossible to beat over a large number of seasons.¹ Over time, no forecaster’s system can ever do better.

But baseball fans still clamor for the projections, and so the cottage industry of preseason forecasts lives on. Modern projections come from three different directions: analytics, expertise and the wisdom of the crowd. All three are more or less equally accurate despite their vastly divergent inner workings (and even motivations), and none gets anywhere close to that 6.4-win mark over multiple seasons. Not that it’s stopped folks from trying.

The sabermetricians who attempt to divine a team’s success in the coming season are relying on algorithms to do most of the heavy lifting. What makes up those algorithms varies. The general idea of a computer projection system such as Baseball Prospectus’ PECOTA² is to take a player’s past performance,³ regress it towards the mean to account for the fact that statistics are an imperfect measurement of talent, and adjust it for aging effects. A simple system that performs these three basic tasks will be practically as accurate as ones with far greater complexity. The competition comes in the margins: PECOTA, for instance, adds the extra wrinkle of basing its aging curves for any given player on the career paths of comparable historical players.

However elaborate the projection model, it should spit out a set of forecasted statistics for each individual player, which can then be fed into team-depth-chart projections to generate win predictions for every team. In PECOTA’s case those predictions have come within an RMSE of 8.9 wins, 2.5 wins away from perfection.⁴

Those who trust algorithmic projections say they do so because of the projections’ empiricism: While plenty of simplifications and assumptions are being made throughout the process, those hypotheses are at least applied consistently across every team. Of course, computer projections are far from infallible. Predicting injuries, for example, is a crucial — yet cruelly unscientific — aspect of team forecasting. To a certain extent, durability can be predicted from a player’s prior history, but the degree to which we can reliably forecast injuries is still quite modest. A more qualitative assessment of a player’s current talent can take into account injuries, mechanical changes, managerial whims and other more nuanced factors that are too fine-grained for pure data-processing methods to capture.

These are the types of appraisals potentially more suited to a process like the one employed by Sports Illustrated’s team of preseason forecasters. Every year, SI, the most widely read sports magazine in the United States, produces a baseball preview that reaches more than 3 million readers.

Its baseball editor, Stephen Cannella, said there’s nothing algorithmic about the magazine’s prediction process. “It’s more a combination of statistics, scouting reports and ‘boots on the ground’ reporting,” he said. Essentially, Cannella’s system taps into the wisdom of a very baseball-savvy crowd. He starts with a basic straw poll of his writers and asks them to rank the relative strength of every major league club. Later, Cannella and his team convene to debate the results and come to a consensus about each team’s projection.

Cannella said he does use statistical forecasts like PECOTA as a sanity check once his team finishes constructing its rankings. “Projections are most worthwhile as a comparison,” he said. According to Cannella, this type of synthesis using both the objective and subjective “reflects SI’s whole baseball approach.”

It turns out that approach is just as good as the wholly quantitative one. I evaluated the accuracy of Sports Illustrated’s divisional picks⁵ using a ranked square-error method and found that, since 2005, Sports Illustrated’s forecasts have been — statistically speaking⁶ — no less accurate about where a team finishes in its division than PECOTA or Las Vegas’ over/under win totals.⁷ Cannella’s approach may not be the most rigorously mathematical, but it works.

Speaking of those Vegas over/unders, they’re the most crowdsourced projections we have. Every spring, various sportsbooks release baseline win totals for each team in the upcoming season, inviting bettors to place wagers on whether they believe a team will win more or less games than the oddsmakers think. These over/unders are not technically predictions — their aim is different from PECOTA’s or even Sports Illustrated’s — but they’re hugely predictive because of the stakes involved.

To get a sense of how Vegas sets these over/unders, I spoke with Ed Salmons, head oddsmaker for the Las Vegas Hotel & Casino, which cleared nearly 1 million sports bets in 2013. Like Cannella, Salmons doesn’t use a strictly algorithmic approach, although he is well-versed in sabermetrics. “I look at computerized [projections] before putting numbers out,” he said. “You’d be foolish not to.”

In addition to using public systems like PECOTA to audit its over/unders, the LVH also has several internal computer models dedicated to prediction. But Salmons isn’t necessarily trying to maximize pure predictive accuracy. “You can’t let the computer go crazy,” he said. “I also have to put out a number I think is good for the marketplace.”

Salmons’ pet example is the 2014 Houston Astros. He hinted that his in-house algorithms call for Houston to win between 65 and 70 games this year (a range also in accordance with PECOTA and Fangraphs’ Steamer), but the Astros are coming off a historically terrible season in which they won just 51 games. The casual bettor is much more likely to see the Astros as a god-awful team than a typically bad one. As Salmons put it, “The public doesn’t even know what ‘regression to the mean’ is.”

So Salmons undershot what he himself felt was most accurate and set the Astros’ over/under at 63.5 wins. That’s in keeping with his guiding philosophy of putting “the highest number out there that the wise-guys will stay away from.” Salmons knew that if he made the projection as high as he wanted, he’d have the wrong balance of bettors. He never wants more than 30 percent of people who bet to be professional sports bettors — otherwise it’s possible he’ll lose too much money. So he had to drop Houston’s win total lower than his computers predicted.⁸

Despite a professed goal that, in some cases, runs counter to maximizing predictive accuracy, Vegas’ forecasting track record is quite strong. In addition to the aforementioned study in predicting division placement since 2005, Vegas’ over/under win totals nearly matched PECOTA for accuracy over the same span — an RMSE of 9.1 wins, or 2.7 wins away from the best possible result of 6.4.

This means a purely statistical system, a market-based method and a hybrid approach each came to the same level of accuracy from three different directions. As Salmons noted when I asked him about the effect of sabermetrics on the betting game, “The market has gotten a lot stronger than it used to be.”

But no matter how strong they get, projections will never reach that 6.4 RMSE mark. Even if we somehow knew beyond a shadow of a doubt that a team had precisely 81-win talent, there would still be a 5 percent chance it would finish with a win total as low as 70 — or one as high as 92 — by pure randomness alone.

Chance is always a confounding factor. Perfection is impossible, even as it beckons forecasters to try to reach it.

Footnotes

There’s only a 1-in-4 chance that a given forecast is closer than 6.4 wins in two straight seasons, even if it had perfect knowledge of player talent, injuries, etc. There’s a 0 percent chance of it happening in eight straight seasons.
As I’m sure many of you know by now, PECOTA was originally created by FiveThirtyEight editor-in-chief Nate Silver, although its current form differs slightly from the model he originally conceived.
Preferably after adjusting for park and league.
PECOTA isn’t the only stats-based projection method, of course — there’s also Steamer, ZiPS, and a whole host of other arcanely named systems. For our purposes, PECOTA is a representative stand-in because it has the largest archive to measure its predictions against.
Sports Illustrated didn’t make win predictions for every season, so I can only test the predictions about a team’s divisional ranking, not its win total.
For sticklers: I ran a one-way ANOVA between the mean-squared errors of each division-ranking prediction; the p-value was 0.81, meaning any differences between the errors of each projection was nowhere near being significant.
Vegas over/under win totals used for this study: 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013.
This is an instructive example of what the Vegas sportsbooks are trying to accomplish. Many people believe that bookmakers are trying to set their odds such that equal money comes in on both sides of a bet, but that’s not always the case.

Footnotes

Comments