The United Nations-funded report finding that 191,369 people were reported killed in the Syrian conflict between March 2011 and the end of April built upon the work of different groups that have undertaken the hard, morbid work of documenting each death.
The major challenge the U.N.-funded researchers grappled with was how to determine which deaths were counted by two or more groups, and to remove duplicates from the count. At the same time, they knew the count — based on reports from individuals, hospitals, mosques and others — would miss tens of thousands (perhaps hundreds of thousands) of unreported deaths.
“Our party trick is to integrate databases,” Patrick Ball, an author on the report, said in a telephone interview Friday.
That trick is no simple one, Ball said. He and his colleagues at the Human Rights Data Analysis Group (HRDAG), a nonprofit that received $30,000 from the U.N. to count Syria deaths, identified five different groups in Syria that have been compiling death reports: the Syrian government and four nongovernmental groups. When two or more groups counted the same person who died, they didn’t always spell the person’s name consistently. “Perfect matches are a pretty small subset of duplicates,” said Ball, executive director of HRDAG.
First machines, then people, took their turn trying to merge the data sets. HRDAG researchers trained a machine-learning method called random forests to rate the likelihood that a pair of deaths from among the 318,910 reports were matches. Then experts reviewed the pairs and decided which were matches.
Megan Price, Ball’s colleague at HRDAG and lead author of the report, discussed the group’s method in a talk in February:
The human review was time-consuming and, in a statistical sense, somewhat futile for getting a count. The machine-learning stage probably was good enough to get within a few thousand of the right total — it got the “low-hanging fruit,” as Ball put it. That’s generally good enough for industry applications, Ball said, and probably good enough for the purposes of Navi Pillay, the U.N. high commissioner for human rights, who wielded the figure Friday in saying, in a U.N. statement, “It is scandalous that the predicament of the injured, displaced, the detained, and the relatives of all those who have been killed or are missing is no longer attracting much attention, despite the enormity of their suffering.”
“She would have been happy, for this purpose, to have this list a couple of months ago,” Ball said.
But getting the number right and reporting it to six significant figures was important nonetheless, Ball said. His team intended the number to be an exact count of all reported deaths recorded by the five groups, and to withstand scrutiny. It’s up to the U.N. whether to publish the list. If it does, “and a few days later, someone says, ‘I found 10 duplicates,’ people would say, ‘This is garbage. You can’t trust this list.’ ”
The need for an exact count worries Ball, because there is no shortcut in the human-review stage. That makes it difficult to scale for other projects. “We’re going to need to improve the machine-learning side. A lot,” he said.
Before researchers removed duplicates, they removed 51,953 death reports that omitted the date of death, the location of death or the person’s name.
The final death total is 98,468 higher than the count as of April 2013. Most of those deaths occurred in the 12 months through April 2014. But many also occurred in the previous period and hadn’t been previously reported or confirmed.
New reports of old deaths come in all the time, Ball said, making it tough to maintain a database. The duplicate-removal process means “it’s a lot like redoing the whole project each time,” he said.
The Syria death count implicitly places a great deal of trust in the five sources of Syrian death reports. “These are all really terrific groups,” Ball said. His experience working in Syria and in other conflict zones has taught him that even groups that aren’t terrific don’t engage in fraud. It’s just too hard to fake a large number of deaths, in his view; people in the area remember deaths and would object to made-up ones. “It’s really hard to fabricate,” he said.
That doesn’t mean the groups have no agenda. “You don’t do this work because you’re disinterested about the world,” Ball said. “You’re doing the work because you’re outraged people are being killed. These are people who, notwithstanding their motivations, do this work in a clear, consistent, professional way.”
His group is no exception. HRDAG is nonpartisan and nonpolitical — “but we are not neutral,” Ball said. “We do this work because we are in favor of human rights.”
The harder work that stems from this latest Syria death count is estimating how many deaths it excludes. Doing so also makes use of the death reports HRDAG integrated for this report. The group is using a method called multiple systems estimation, which is similar to the capture/recapture method used to count some animals in the wild. The idea, as Ball explained it, is akin to estimating the size of a room by listening to how often squeaky, bouncy balls thrown into the room collide with each other. The more they do, the smaller the room. Similarly, the more often two groups count the same death, the smaller the total count of deaths. Just 40 percent of the deaths the researchers have documented, though, come from more than one source (not counting the deaths reported solely by the Syrian government). And just 1 in 40 deaths was reported by all four of the nongovernmental groups.
The estimation project faces several challenges. Among them: Some of the groups have shared data with one another, which could artificially inflate the number of data “collisions” and understate the total number of deaths. Also, preliminary results suggest the number of deaths that go unreported varies wildly by region — much more so than in Colombia, another country in which HRDAG applied this method — and “it varies in a nonlinear way over time and space,” Ball said. “It’s not like one region has a predictably low coverage rate, and another’s is predictably high. It bounces all over the place.”
“We’re being super cautious about checking everything and thinking really hard about how we control for these complexities,” Ball said. “A year ago, I thought we were two months from finishing. Now I think we’re six months away.”