I recently studied the representation of women in science by counting how many papers on the arXiv, a website where scientists have posted nearly a million papers, were written by women. I deduced the gender of the papers’ authors using each author’s first name. But I ran into a problem: It’s not always possible to infer the gender of a scientist from his or her name. For example, some scientists publish under their initials, and we might expect female scientists to be particularly likely to do this in order to avoid gender discrimination.
So, how can we figure out the gender split of the first-initial authors when we don’t know their full names?
We can’t tell whether a particular author who uses only a first initial is male or female. But because names for men and women tend to start with different letters, we can see whether the distribution of first initials more closely resembles the pattern for male names or female names. It turns out that these patterns look quite different:
(I set aside names beginning with “Q” and “X”, which were used too rarely to allow accurate frequency estimates. I also used only papers written by a single author, since that author would have full control over whether to use his or her full name.)
Then we can look at the pattern for authors who use only first initials. It looks more like the male pattern (the gray line more closely follows the blue line), reflecting the fact that most authors who use initials, like most scientists in general, are men:
By modeling the first-initial pattern as a mixture of the male pattern and the female pattern, we can estimate what fraction of authors who use only first initials are female. For example, if the first-initial pattern is a mixture of 80 percent male pattern and 20 percent female pattern, our model will estimate that 20 percent of the authors who use only first initials are female.
The model estimates that the fraction of women among authors who use only first initials is considerably higher than the fraction of women among scientists as a whole: 16.2 percent versus 11.8 percent. This suggests that female scientists may be disproportionately concealing their gender. On the one hand, 4.4 percent doesn’t seem like a big gap; but that estimate means a third more women are hidden among the first-initial authors than we would expect. (The disparity becomes even larger — 14.0 percent versus 8.3 percent — when we look at authorships, not at authors — in other words, when we count a scientist twice if she’s written two papers.)
There are caveats to this analysis. It assumes that the first-initial pattern is a perfect mixture of the male and female patterns, but there are reasons this might not be the case. If scientists in certain countries, or certain fields, are disproportionately likely to use their first initials, that will lead to imperfect model fits. But the model fits the data quite well:
And it’s plausible that female scientists would be disproportionately likely to conceal their gender. Female authors in non-scientific fields have been doing so for centuries. The fact that papers by women receive fewer citations, as multiple studies have found, also provides an incentive for women to use only their initials.