Dear Mona, What’s The Most Common Name In America?

Dear Mona,

What are the most common first- and last-name combinations in the United States? Is John Smith really the most common name?

Kieron George, 21, North Yorkshire, U.K.

Dear Kieron,

Well, this is a real head-scratcher — not least because if you want a list of the first and last names of Americans, you’d better have either a lot of time and money or work for the NSA. Unfortunately, I don’t fit either description, so I’m going to try to piece together two separate databases — one for first names and one for surnames. But as I’ll explain, those data sets can’t be stitched together so easily. If they could, we’d be able to say that “John Smith” is the third most common full name in America, but in reality it probably doesn’t even make it into the top 10.

To get you an answer, my colleague Andrew Flowers and I tried a more sophisticated technique that reached a different conclusion: We think the most common name in America might very well be James Smith.

Here’s how we got there.

First off, first names. The Social Security Administration (SSA) has a database of all first names back to 1880. That date should tip you off to a problem — sadly none of the 1,746 babies christened “Minnie” in 1880 is still alive today. To take deaths into account, we looked at the number of babies born each year since 1910 and, using actuarial data on life expectancy, tried to figure out how many of them are still around. (Our boss, Nate Silver, used a similar methodology when he looked at the typical age of Americans with various names.)

Factoring in life expectancy also corrects for the varying popularity of first names over time. For instance, women named Brittany tend to be younger than, say, those named Ethel.

But a considerable chunk of people who live in the U.S. — 13 percent of the population — wasn’t born here, and their names aren’t included in the SSA’s database of baby names. We needed a way to count them, too, and started by focusing on immigrants who are Hispanic or Latino, because they make up almost half the country’s foreign-born population.

We took the top 1,000 most-common first names from the adjusted SSA data, found out how common they were in each state as of 2013, and then calculated a “correction” factor based on how much more common the name was in states with higher Hispanic populations. (More methodological detail is in the footnotes.¹)

This way, we were able to make sure Hispanic and Latino immigrants were better reflected in the data. As a result, “Maria” moved up from the 97th most popular first name in America (according to the unadjusted SSA data) to 73rd.

After those calculations, we had the number of Americans with just about any first name.²

You can see that nine of the top 10 names are labeled as male names in the data. That’s because the distribution of female names tends to be more diffuse (or, to use less statistical jargon, parents tend to be more imaginative when they name their baby girls). As a whole, American first names aren’t very diverse — almost 30 percent of Americans have a given name that appears in the top 100 list.

Next up, surnames. This was a bit easier because the Census Bureau has a more straightforward database of the number of Americans with each surname.³ It was last published in 2000, which presents a problem because American nomenclature could have changed a lot in the past 14 years. To account for this, we looked at the ethnic breakdown of the U.S. population then and now, and the ethnic breakdown of individuals with those surnames. We were then able to adjust the number of instances of each surname by the growth in the racial/ethnic groups of those having that name.⁴ So, for example, the surname Smith is about 74 percent white, while the surname Garcia is 91 percent Hispanic. We then assumed the number of white Smiths grew since 2000 at the same rate as the overall white population (which was just over 1 percent). And ditto for Garcia — the Hispanic share of those named Garcia grew at the overall Hispanic rate. This caused a major reshuffling of the top surnames, because the Hispanic population grew a lot faster (by more than 50 percent) than the white population over this period. In the end, the proportional share of Garcias in the population goes up, while the share of Smiths goes down.

Even after that calculation, it was clear that although many first names come in and out of fashion, surnames tend to change at a much slower rate. America’s most common surname by a mile is Smith — 2.5 million Americans have it, ahead of 2 million with the surname Johnson.

Now the tricky bit. We turned those numbers on first names and surnames into probabilities by dividing them by the total number of people in the country. That showed that the chance of being called “Michael” is 1.1 in 100, and that the chance of having the surname “Smith” is 0.8 in 100. To figure out the probability of being called “Michael Smith,” we just multiplied those two to get 0.008 in 100.

Here’s something you’ll rarely hear me say in this column: This methodology is crap (and it’s still crap even after all those adjustments for death rates, immigration and population growth). It assumes that you can treat a person’s first name and last name as independent variables, i.e. as though they have nothing to do with each other. But guessing the number of Americans with a particular first- and last-name combination isn’t like rolling two dice — both names are closely related to each other.

To illustrate the point, let’s use the 20 most common first and last names in the country. If you just multiply the probabilities for each of those, you get a neat little (and highly misleading) chart that looks like this:

The first and most important reason this is wrong is that basic demographic facts tie certain first and last names together. We know that surnames aren’t equally distributed among Americans because the Census Bureau publishes those surnames with a breakdown by racial and ethnic group. It shows that in 2000, 99.9 percent of Americans with the surname “Heimerl” were white, and 99.0 percent with the surname “Rezendiz” were Hispanic.

Those facts affect first names, too. Intuitively, you probably think that there are a lot more Americans called “Maria Martinez” than “Maria Miller.” Last year, Lee Hartman, an associate professor emeritus of Spanish at Southern Illinois University, tried to test that intuition. Hartman had a simple idea: Look up first- and last-name combinations on the White Pages website and see whether simple probabilities captured reality. He found those basic probabilities were way off the mark.

Hartman didn’t choose the top names for his table using SSA and census records, so it looks a little different from ours. But it does show just how strong those racial and ethnic factors are.

Take the first name “Thomas” and the last name “Rodriguez” as examples. If you assume the likelihood of having either name is independent, then the chance of an American being named “Thomas Rodriguez” is about 0.000019 percent, and there would be about 5,940 Americans called “Thomas Rodriguez.” Searching on WhitePages.com, though, Hartman found that the number of people in the phonebook with that name was 84 percent lower than the basic probabilities would suggest. You can see just how much simple probabilities over- and underestimate the prevalence of any particular name using Hartman’s matrix below (click to expand):

There are other gems in this chart that reveal just how bad an idea it is to assume names can go together in any order. The number of Americans called “John Johnson,” “David Davis,” “Thomas Thomas” or “William Williams” is (thankfully) far lower than simple probabilities would suggest. Even less bold alliterations, such as “Mark Martin” and “Daniel Davis,” appear to be unpopular among parents picking names.

If you were to extend the list beyond the top 20, you’d probably notice some other improbable name combinations that a computer, even one wired to factor in ethnicity, simply couldn’t detect. As Andrew Flowers put it, “No matter how pretty the name, I don’t plan on calling a daughter of mine Rose.” And in the highly unlikely event that I were to change my surname, I probably wouldn’t go for “Lott” (no offense to the 25,118 Americans with the surname).

“John Smith” (of which there are almost 24 percent fewer in the phone book than simple probabilities would suggest) appears to suffer a double blow from being not only a huge figure in history, but also a placeholder name that’s become a clichéd byname for blandness.

So we took one last statistical step to account for the fact that the chances of having a certain first name (like John) isn’t independent of the chances of having a certain last name (let’s say Smith). We adjusted our simple estimates of the most likely name combinations by taking into account Hartman’s research on which combinations are more and less likely than basic probability would suggest.

As a result, “Michael Smith” dropped down from the most likely name in America to second place and was bypassed by “James Smith.” There were more dramatic changes, too. For example, according to independent probabilities, “Maria Garcia” was expected to be the 354th most common full name (and “Maria Smith” would rank as No. 74). But according to Hartman’s data set, “Maria” and “Garcia” correlate nearly 700 percent more than you’d expect. That means “Maria Garcia” skyrockets to the 15th most common name combination overall. (We’ve posted all the data used in our analysis to our GitHub page.)

Hope the numbers help, Kieron George (incidentally, our probabilities suggest that there are 0.24 people with your name in the United States, so if you were to move over here, I reckon you’d be unique).

Mona (and Andrew)

Have a question you would like answered here? Send it to DearMona@fivethirtyeight.com or @DataLab538.

Footnotes

Specifically, we ran 1,000 regressions, one for each of the top 1,000 most-common first names. The variables used in the regression were, on the one hand, the frequency that that name appeared in each state (this was the dependent variable), and on the other hand was the share of the state’s population that is Hispanic (this is the independent variable). This latter figure comes from the Census Bureau and accounts for immigration, including undocumented immigrants. After running these 1,000 regressions, what resulted was Hispanic correction coefficients specific to each first name, indicating how Hispanic each was. From those coefficients, a “correction” (either up or down) was calculated for that first name by using the ratio of the national Hispanic population share to the foreign-born population share. However, the correction for each name is bounded by the ratio of the Hispanic population percentage and the foreign-born population percentage: by 0.9496 on the low end, and by 1.3460 on the high end. Plotted below are the correction factors for top 1,000 first names, ordered by how Hispanic they were calculated to be.
I say “just about” because the SSA only publishes data for names that recur at least five times in a single year to protect privacy, so even data on the native-born population isn’t perfect. According to our calculations, about 92,000 first names cover about 80 percent of the 2013 population, or about 255 million people.
Again though, it’s clipped at names that appear at least 100 times, in the interests of privacy.
The Census Bureau breaks down surnames by five categories: white, black, Hispanic, Asian, two races and American Indian/Pacific Islander. The growth rates in those five categories were then used to update the number of instances of each surname.

FiveThirtyEight

Footnotes

Comments