For the past few months, we’ve heard a lot about statistical models and their predictions of what might happen as the COVID-19 pandemic rages on. Their estimates have varied widely — in part because almost every piece of data that goes into the models is imperfect. But just what is an epidemic model, anyway, and how do they work? How do we know which ones to trust? At a time when lives depend on the accuracy of these models, it’s more important than ever to understand their strengths and limitations.

I am a university professor, a scientist, and an infectious disease modeler. I’ve invented some COVID-19 models. My colleagues and I have spent years teaching doctoral students to derive new and better models. When I do, I emphasize that models are a *tool for making our ideas clear*.

For instance, take our struggle to understand how the novel coronavirus is transmitted. The basic hope of our mitigation efforts is that testing and isolation of contagious cases can limit the spread of COVID-19. We also need to worry about the contagiousness of an infection, how many superspreaders there are, and how long a person is infectious before they begin to show symptoms. But we’re not quite sure about how all those variables interact. This is where models come in.

Infectious disease modelers represent the process of transmission as a *dynamical process,* which is just a fancy way of saying that the variables change in response to each other. Exactly *how* modelers represent this process differs according to the tastes and talents of whoever’s designing the model. Some modelers write down equations and think about the interaction between an infectious person and a susceptible person like two reagents in a chemical reaction. Other modelers think that approach simplifies things too much. It’s better, they say, to write a computer program that represents all the people in a population, like characters in a computer game, and then let the game play out according to some algorithm to see what happens. In either case, the scientist has posed what we call a *mechanistic *model that is governed by certain rules.

There’s a challenge with mechanistic models, however. To fully specify a mechanistic model, we have to give actual values to *every *single feature of the model. If a COVID-19 model factors in isolation and recovery — and it must or else every person who gets infected would go on to infect others for all eternity — then the modelers must specify the *isolation rate *and *recovery rate*, and also the transmission rate. Actually, they need a value for every single aspect of the world they decide to represent. (And they want to represent all the ones they think are important — there might be a lot of them.)

Where do we get these values from? There is no periodic table where we can look up the isolation and transmission rates of all infectious diseases. In fact, those isolation and transmission rates are almost certainly highly variable from one community to another because things differ so much between places. Think, for example, of health care infrastructure, the availability of tests, and the average number of individuals people come into contact with daily.

There is no single way that infectious disease modelers arrive at reliable (and sometimes unreliable) numbers for these values. Some of these things we can measure. For instance, we can measure the speed at which people isolate by asking people who are hospitalized when they first began to feel symptoms and when they were first isolated (hospitalized or quarantined). If the delay is five days on average, in some population or other, then the isolation rate for that population is ⅕.

Another way we can arrive at a number for an unknown quantity is by “fitting” our model. Let’s imagine a super-simple model as an example, one with only two variables: the transmission rate and the recovery rate. When fitting a model we take some real-world data — say, the number of newly discovered cases in a population — and we tune those two unknown variables from one extreme to the other to find the intermediate value at which the model most looks like the actual, real-life data. In the case of the novel coronavirus, the data could be the number of cases, hospitalizations or deaths.

Of course, to say something scientifically, we have to be precise about what we mean by “most looks like the actual, real-life data.” We do this with yet another model, typically a statistical model, that calculates the probability of observing the data we actually have, given different assumptions. We say that those models for which our data are relatively probable have high *likelihood*, and there’s a whole statistical framework for comparing the likelihoods of different models.

People looking at models — and people using them to make policy decisions — should be asking additional questions too. For instance: How all-encompassing are the model’s predictions? If the epidemic doesn’t play out the way the model said it should, why is that? Does that mean the model was wrong? Is a wrong model a bad model? What is a prediction anyway? What is the nature of the relationship between the model and *reality*? (Being an infectious disease researcher can get existential in a hurry.)

So, even though models are for making ideas clear, the path to making one isn’t always so clear. Yet for all their imprecisions, they’re still better than nothing. Either we can use models, trying to stay cautious about how much they’re really telling us, or we can rely on conjecture, gut reactions and expert opinion alone. Only the first path is a transparent one with a built-in mechanism for self-correction.