Does a statistical property named Benford’s Law point toward fraud in the Iranian elections? That’s one possible reading of a new paper (.pdf) by Boudewijn Roukema of Nicolaus Coprenicus University in Toruń, Poland. I think the paper is intriguing, but like Andrew (yes, we’re both writing on the same subject), I also have one or two reservations.

First, let me explain in a bit more detail what Benford’s Law is. Or actually, let me let Wikipedia explain:

Benford’s law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 almost one third of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty. This distribution of first digits arises logically whenever a set of values is distributed logarithmically […]

This counter-intuitive result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature).

The specific distribution of first digits (in the number 2,684, two is the first digit) that Benford’s law forecasts is as follows:

Wikipedia calls this property counter-intuitive, but I don’t know that it’s entirely so. For instance, think about the number of daily visitors to the millions of websites that are out there in the world, which classically follows a power-law distribution. There are a lot more websites that have 1,000-some visitors a day than 9,000-some visitors a day, and there a lot more websites that have 100-some visitors a day than 900-some visitors a day. For that matter, there are a lot more websites that have 1 visitor a day than 9 visitors a day. Website traffic very probably obeys Benford’s law or something approaching it.

Or, to give you an example where I actually have some numbers to show you, let’s look at the first digit for all places (cities and down) in California as of the 2000 Census.

This distribution obeys Benford’s Law almost perfectly.

Benford’s Law is sometimes useful in detecting fraud. For example, suppose you have a company policy that requires all expenses over $100 to be approved by the HR department. Chances are that you’ll have a lot of employees magically finding things which cost $99 or $90 or $87 to expense — and relatively few that cost $102 or $110. This would radically violate by Benford’d Law and could be easily detected by it; of course, it could be detected in a lot of other ways too. But even when you don’t have a specific constraint like the $100 threshold I described above, Benford’s law is sometimes useful in these cases, because human beings intuitively tend to distribute the first digits about evenly when they’re making up “random” strings of numbers, when in fact many real-world distributions will be skewed toward the smaller digits. Something to keep in mind when you’re cheating on your taxes!

What Roukema set out to do, then, is to test the distributions of the vote totals for the four candidates across the 366 reporting units that Iran’s Interior Ministry has published numbers for. Here’s what he found when looking at Mehdi Karroubi’s vote totals:

As you can see from the graph, there are a lot more totals beginning in ‘7’ than you’d anticipate from Benford’s Law. The odds of this occurring by chance alone are extremely remote — about 10,000 to 1 against. The odds of an anomaly of this magnitude occurring for any of the nine leading digits for any of the four candidates are also quite remote — about 140 to 1 against.

Mahmoud Ahmadinejad’s vote totals also look a little funny — there are more numbers beginning with ‘2’ and ‘3’ and fewer beginning with ‘1’ than you’d anticipate — although the level of statistical significance is not nearly as high as in Karroubi’s case.

Roukema speculates that Iranian officials could have been taking cases where Ahmadinejad’s vote total began with a ‘1’ and switching it to a ‘2’ — for instance, in some town where he received 1,954 votes, they would report his having received 2,954 votes. This hypothesis may take on more meaning in light of new and as yet unverified allegations that reported turnout exceeded 100 percent in several dozen Iranian towns.

The reason I’m holding back from fully endorsing this is because it’s not clear to me whether this particular distribution should indeed obey Benford’s Law in the first place. For instance, let’s take a look at the distribution of votes for Al Franken (before the recount) in last November’s senate race in the 4,131 precincts in Minnesota.

This hugely violates Benford’s Law — there are not nearly enough totals beginning in 1 and too many beginning in numbers like 5, 6 and 7. The odds of these anomalies having occurred by chance alone are greater than a quadrillion to one against.

Of course, some people think the election in Minnesota was rigged too — so perhaps this is a poor example to use! But the reason this pattern emerges is because precinct sizes in Minnesota are not truly random — once a precinct has to serve more than a couple thousand voters, it is liable to become too crowded to do so adequately and a new one will be created. There seems to be a particularly large number of precincts in Minnesota that are designed to serve between 1,000 and 2,000 voters; since Franken won about 42 percent of the votes statewide, this leads to a relatively high number of instances where his vote totals are in the high single digits (672, 704, 588, etc.)

It’s not clear to me whether the voting units that Iran’s Interior Ministry reported on behave more like towns, in which case we might expect the voting distinctions to obey Benford’s Law, or more like precincts, in which case we probably wouldn’t. The way the units are described to me in the spreadsheet I’m working from are “city/county”, which implies that sufficiently large cities are treated as their own units, whereas smaller ones — it looks to me like perhaps those that have fewer than about 15,000 people — have their results aggregated at some level resembling American counties. If there are these sorts of artificial constraints placed on the size of the reporting units, we might expect some anomalies from a Benford’s Law perspective.

Still, I don’t know that we’d expect the particular anomaly where a lot of

Karroubi’s vote totals begin with 7’s; we’d probably expect to see something more like the Franken vote distribution where there are a lot of 7’s but also a lot of 6’s and 8’s.

Then again, I’m not sure what particular strategy is accomplished by taking one of the minor candidate’s vote totals and having them begin with 7’s. Perhaps, if there was tampering, what Iranian officials feared was not precisely Ahmadinejad losing but his winding up with less than 50 percent of the vote, which would send things to a run-off, presumably against Mousavi. Since Karrobui’s voters are, somewhat self-evidently, less well organized than Mousavi’s, perhaps it was easier to take votes from them to accomplish this goal.

Overall, I’m a bit less skeptical than Andrew, in part because, as we’ve been reporting on, Ahmadinejad performed oddly well in areas where Karroubi had been strong in 2005. But I still consider Roukema’s evidence to be somewhat circumstantial — it’s far from clear to me that Iranian voting totals should be obeying Benford’s Law in the first place.