We Used Broadband Data We Shouldn’t Have — Here’s What Went Wrong

Over the summer, FiveThirtyEight published two stories on broadband internet access in the U.S. that were based on a data set made public by academic researchers who had acquired data from Catalist, a well-known political data firm. After further reporting, we can no longer vouch for the academics’ data set. The preponderance of evidence we’ve collected has led us to conclude that it is fundamentally flawed. That’s because:

The academics’ data does not provide an accurate picture of broadband use at the county level relative to other sources.
Some of the data that the academic researchers received from Catalist originated with a third-party commercial source, and Catalist acknowledged that it did not vet that data itself. The researchers and Catalist also disagree about what Catalist said the data represents and what it could be used for.

If we’d known then what we know now, we would not have relied on the data set — which attempts to estimate the share of a county that has broadband internet at home, for nearly every county in the nation — in writing the two articles. For the first article, we identified the county with the lowest broadband rate in the data set (Saguache, in Colorado) and profiled it while also detailing how rural areas of the country can struggle to find a broadband connection. For the second, we used the data set to identify an urban area with limited broadband use — Washington, D.C. — and then highlighted disparities in Internet access among residents of the city. The idea behind the stories was to demonstrate that broadband is not ubiquitous in the U.S. today, even as more of our lives and the economy go online. We stand by this sentiment and the on-the-ground reporting in the two stories even though we have lost confidence in the data set.

We should have been more careful in how we used the data to help guide where to report out our stories on inadequate internet, and we were reminded of an important lesson: that just because a data set comes from reputable institutions doesn’t necessarily mean it’s reliable.

The flawed data set that we used for both stories came from researchers at Arizona State University and the University of Iowa. The goal of their research was to try to fill gaps in broadband data and to highlight usage disparities between different geographic areas, like cities and less populous counties. For more populous geographic areas — like states, metropolitan areas and bigger counties — the researchers relied on data from the U.S. Census Bureau. But reliable estimates of internet access in more sparsely populated areas are not readily available.¹ So to get information that would allow them to estimate broadband use in counties of all sizes, the researchers turned to a third party: Catalist.

Catalist is best known for its political data. It has provided information on voting-age Americans to progressive organizations — it helped Barack Obama in the 2008 election and counts Emily’s List, the Sierra Club and other well-known groups among its clients. Academic institutions use Catalist data too, particularly for research on voting behavior and elections. Its website claims that the firm’s “national database contains more than 240 million unique voting-age individuals.” The data is compiled from sources such as public voter files, the U.S. Census Bureau, the Federal Reserve, the Association of Religion Data Archives and commercial data providers.

For their broadband work, the researchers from Arizona State and Iowa purchased a 1 percent sample of the 240-million-person file, which provides information on demographics and voting behavior, among many other topics, for individuals in the sample.

Catalist’s and the researchers’ accounts of the sale differ. Caroline Tolbert, a researcher from the University of Iowa who spoke to FiveThirtyEight on behalf of the research group, said in an interview that Catalist had assured the academic researchers that a variable in the data set would be a good proxy for broadband use. Tolbert said the researchers trusted Catalist’s reputation in the academic world.

Catalist declined to make its data scientists available to speak to FiveThirtyEight on the record but provided an emailed statement from its CEO. In it, Catalist chief executive Laura Quinn said Catalist has “no record or recollection of describing this as a ‘proxy for broadband usage’” and that the way the academic researchers used the data they purchased from Catalist was inappropriate.

1) An inaccurate data set

After the articles were published, FiveThirtyEight was alerted to possible problems with the broadband data. We looked into it and found that the data set we used had a fundamentally different understanding of broadband access than other sources did.

We compared the data published by the researchers from Arizona State and Iowa with data on broadband access across the country from the U.S. Census Bureau’s American Community Survey and the Federal Communications Commission. It was clear that the ASU/Iowa number for broadband use in Washington, D.C., was quite different from the other sources’ numbers.² That was true for many other counties as well. (We limited our analysis to the 820 counties that all three sources have in common.)

According to the ASU/Iowa data, only 28.8 percent of Washington, D.C., had broadband internet at home in 2015-16. (Because of the way the researchers’ data set was presented and because we don’t have access to the data they received from Catalist, we can’t say for sure whether that refers to the share of the District’s population or the share of the District’s households.³) But the corresponding numbers from the ACS⁴ and FCC, both for 2016, are 70.3 percent and 70.1 percent, respectively.⁵ This means the other measures show at least twice as much broadband use as ASU/Iowa did for Washington.

In addition to the discrepancies in the estimates for individual counties, we found that the distribution of the ASU/Iowa data looks quite different from the distribution of both the ACS and FCC data. The variation among counties is much lower in the ASU/Iowa⁶ data set as compared to the other two sources.⁷

But the biases in the data set aren’t consistent across counties. For some, the ASU/Iowa data has a low estimate relative to the other sources, and for others, it has a higher estimate.

Another cause for concern is that the ASU/Iowa data fails some common-sense checks.⁸ If the ASU/Iowa data were really capturing home broadband rates, we would expect the researchers’ measure to be correlated with household income. But it isn’t. For example, San Francisco County’s median household income is $87,701,⁹ but the ASU/Iowa data says only 46.6 percent of that county has home broadband. Now consider Apache County in Arizona — it has a median household income of $32,460 and a 57.4 percent home broadband rate according to the ASU/Iowa data.

The correlation between broadband access as measured by ASU/Iowa and median household income is 0.27, indicating a fairly weak relationship. In contrast, the correlations between broadband access and income in the ACS and FCC data sets are 0.70 and 0.62, respectively.¹⁰

When presented with the findings from our analysis, the ASU/Iowa researchers provided a statement in which they disagreed with our contention that we should see a connection between their broadband data and median income, calling that variable “a poor predictor of broadband or internet use.” However, several studies suggest otherwise. A recent study by the Brookings Institution found median income to be highly correlated with broadband subscription rates. And the FCC’s 2016 Broadband Progress Report shows places without access to broadband have lower median household incomes.

One small part of the explanation behind the disparities between the ASU/Iowa data set and the other sources may be the differences in how each entity defines broadband use or subscription. The ACS measures it through surveys that ask: “Do you or any member of this household have access to the Internet using a broadband (high speed) Internet service such as cable, fiber optic, or DSL service installed in this household?” The FCC relies on data from service providers and counts the total number of residential fixed internet access service connections per 1,000 households by census tract.¹¹ According to the researchers’ data file, the ASU/Iowa data set uses the data from Catalist to estimate the percentage of the population with a home computer and home broadband, as measured by a subscription with an Internet service provider.

The ASU/Iowa researchers told us in their statement that they expected the Catalist-derived data to be consistently different from other sources of broadband data because of the difference in how it was collected. However, the researchers said that they no longer have confidence in the data set’s estimate for broadband use in Washington, D.C. “Upon further examination, Washington DC, which was highlighted by FiveThirtyEight, appeared to be an outlier in the data,” Tolbert and Karen Mossberger, one of the researchers from ASU, said in the statement. And while it’s true that different methods of data collection can produce different outcomes, if all the sources are trying to measure the same underlying phenomenon of at-home broadband access, they should yield similar results.

After reviewing the quantitative differences in the ASU/Iowa data set, we were concerned. We lost further trust in it as we learned there were differing accounts of what Catalist said the data could be used for.

2) Problems with the Catalist-provided data

Based on our analysis, the ASU/Iowa data set’s problems stem in large part from the original data itself, although we don’t have access to it to test our hypothesis. Neither the academic researchers nor Catalist would share the purchased data with FiveThirtyEight.

The ASU/Iowa researchers purchased the 1 percent Catalist sample to obtain a handful of key variables. One of those, called HTIA, was used to create the county-level estimates of broadband use. Catalist’s codebook (a document that includes descriptions of the variables in the Catalist data) — which the ASU/Iowa researchers provided to FiveThirtyEight — explains HTIA this way: “Denotes interest in ‘high tech’ products and/or services as reported via Share Force. This would include personal computers and internet service providers. Blended with modeled data.” In an interview, Tolbert said the researchers were told by Catalist that the measure was a good proxy for broadband access. “We wouldn’t have spent $20,000 — which for us is a ton — if we weren’t told by Catalist that this was very good proxy for us of high-speed internet access,” Tolbert said. “I think we knew exactly what we were buying.”

Catalist disputes this version of the sale. “We have no record or recollection of describing this as a ‘proxy for broadband usage,’” Quinn said in her statement. “If there is any written evidence of anyone on our staff having made the claim that this was an appropriate proxy measure of broadband use, we have not seen it from our internal review nor have we been provided it by FiveThirtyEight.”

The HTIA variable that the researchers used came from a commercial source, InfoUSA, a company that tracks consumer habits and preferences for businesses. Quinn described HTIA as “a variable that we license from a commercial data provider (InfoUSA).” She said Catalist clients typically use commercial data like the HTIA variable “as part of a large suite of data to inform individual-level marketing efforts.”

“While we ‘stress test’ to evaluate how useful the data is for these types of efforts, we do not validate every piece of data for every possible use case,” she said. “For the HTIA variable, aggregate analysis is not the primary use case, so we did not stress test it for this use.” To create their data set, the researchers aggregated the individual-level responses for HTIA to the county level.

In the course of our reporting, we were unable to confirm what goes into HTIA. InfoUSA declined to comment on that question. Quinn said “basic statistical checks and examinations of the data’s properties” should have been done ahead of any analysis. “Comparing the average HTIA value to historical county-level data from the Census would have clearly and quickly revealed that HTIA was not an appropriate choice for this research,” Quinn said.

Adie Tomer, who is a researcher with the Brookings Institution’s Metropolitan Policy Program and worked on a recently released report on broadband availability and subscription in U.S. neighborhoods, said that it was important to be skeptical when buying data and to ask sellers for a confidence interval — a statistical range that accounts for the uncertainty of estimates. “If they cannot tell you how they calibrate and validate, it is like the ultimate red flag,” Tomer said.

The reason that ASU/Iowa consulted Catalist in the first place is because information on broadband use in the U.S. is less than ideal. “What this discussion highlights is the need for better data on broadband adoption and use for both research and policy,” Tolbert and Mossberger wrote in their statement. “As of 2018, we do not have precise or accurate estimates of broadband adoption and use for the population.”

Tomer agreed. He said government data often leaves researchers with nothing more than a “hazy picture” of broadband usage.

FCC data is zoomed out, by nature. Rather than provide information on a household level, it offers data for census tracts. If you’re studying city neighborhoods — say you’re trying to figure out how internet use in poor neighborhoods is different from in rich ones — this lack of granularity can be a problem.

Steven Rosenberg, chief data officer for the FCC’s wireline competition bureau, explained that the commission does collect more granular information on broadband speed and what kinds of technologies are used to deploy internet — fiber-optic cables or fixed wireless dishes, for example — but doesn’t release it. That’s because the commission is sensitive to internet providers’ competitive interests. The commission is wary of “one carrier learning about another carrier’s market share or where their customers are,” Rosenberg said.

Because there is no official mandate that all Americans have access to high-speed broadband, Tomer said, internet service providers don’t have to be rigorous in their reporting of data. “There’s an extreme interest for the ISPs to be hiding their hand,” he said.

Tomer said the lack of data clarity from the government in this area of the economy means that researchers are unable to piece together an accurate picture of what kind of internet access Americans, rich and poor, have. “What we have to do, to be frank, we as researchers have a duty to flag where there are market failures that are impacting the American economy,” he said.

Footnotes

The U.S. Census Bureau only makes information publicly available for areas with fewer than 20,000 people when it has five-year estimates.
Washington, D.C., is a “county equivalent” entity, according to the U.S. Census Bureau.
The glossary that the researchers provide with their data set describes the data as an estimate of “the percent of the population with a home computer and high speed Internet access.” And in the course of our reporting, Tolbert described this data as measuring what share of individuals in a county have broadband internet at home, but the Catalist variable on which the researchers’ work is based suggests that it is a measure of household access.
We used the 2016 ACS one-year estimates.
The FCC doesn’t provide county-level data. Instead, it provides information on household broadband connections in every census tract. But rather than providing a single percentage for every tract, it gives a range, assigning each tract to a quintile (e.g., 0 to 20 percent, 20 to 40 percent, etc.). Following the Brookings Institution’s methodology in this analysis, we used those ranges to make three estimates of broadband access for each county: The first assumed that the lowest point of each census tract’s range was the correct percentage, another assumed the highest point was, and the third assumed the median point of the range was. We then aggregated the census tract numbers within each type of estimate (low, median, high) to get a range of possible broadband subscription rates for each county, weighting by the population size for each tract. The range for Washington, D.C., is 61.2 percent to 80.1 percent. The estimate of broadband access in Washington, D.C., using the median of census tract quintiles is 70.1 percent. All references in this article to FCC values are based on the median calculations.
s.d.=5.5; min/max=28.8/67.9.
ACS: s.d.=9.6; min/max=25.3/87.9. FCC (median quintile values): s.d.=14.3; min/max=7.9/89.6.
Social scientists often refer to this kind of common-sense evaluation as construct validity.
According to the ACS’s 2016 five-year estimates.
All correlation coefficients are significant at the p<0.01 level.
The FCC has two data sets on broadband connections. For our analysis, we used the data specifying connections with at least 10 Mbps downstream and 1 Mbps upstream.

1) An inaccurate data set

2) Problems with the Catalist-provided data

Footnotes

Comments