The Datasets We’re Looking At This Week
You’re reading Data Is Plural, a weekly newsletter of useful/curious datasets. Below you’ll find the July 27, 2022, edition, reprinted with permission at FiveThirtyEight.
Wildfires around the world, hospital price lists, monkeypox strains, startup factories and shark incidents.
Wildfires around the world. The Global Wildfire Information System, expanding on the work of the European Forest Fire Information System, uses satellite data to provide weekly and annual estimates of the number of fires and area burned in 200-plus countries. Its bulk data indicates monthly burned hectares by country, sub-country unit and land type from 2002 to 2019, as well as the boundaries of individual fires from 2001 to 2020. It also publishes gridded spatial data relating to fire danger forecasts, active fires, emissions and more. As seen in: El Diario’s analysis of forest fires in Spain. [h/t Olaya Argüeso Pérez]
Hospital price lists. Since January 2021, the U.S. government has required hospitals to publish machine-readable files listing the standard charges for all items and services they provide. But there’s no standard format for these price lists (also known as chargemasters), no official central repository of them and compliance has been lacking. Seeing those problems, the versioned-data platform DoltHub ran a paid crowdsourcing campaign earlier this year that pulled nearly 300 million prices from the published lists of roughly 1,800 hospitals into a single database. Related: Thanks to an earlier price transparency rule, California posts chargemasters for hundreds of hospitals, with records going back to 2011.
Monkeypox strains. Nextstrain, “an open-source project to harness the scientific and public health potential of pathogen genome data,” has begun analyzing genetic sequences from hundreds of monkeypox virus samples, the vast majority from infections in the past few months. The project provides metadata on each sample, including the date, country, variant and mutation metrics, as well as detailed sequencing data from NCBI Virus. Previously: Coronavirus variant data from outbreak.info (DIP 2021.03.10). [h/t Karsten Johansson]
Startup factories. Venture studios are firms that build and launch startups. Jim Moran’s Venture Studio Index tracks 260-plus of them, as well as 1,200-plus of the startups they’ve launched. The dataset, “collected manually by a team of researchers familiar with venture capital and the technology startup ecosystem,” includes founding years, locations, employee counts, relevant URLs and more.
Shark bites. Madeline Riley et al. describe the Australian Shark-Incident Database, which contains details about 1,100-plus shark bites (and attempted shark bites) between 1791 and early 2022, gathered by the Taronga Conservation Society using “questionnaires provided to shark-bite victims or witnesses, media reports,” and information from state agencies. Read more: “New dataset shows shark bites in Australia are increasing and researchers want to know why” (The Guardian).
Dataset suggestions? Criticism? Praise? Send feedback to email@example.com. Looking for past datasets? This spreadsheet contains them all. Visit data-is-plural.com to subscribe and to browse past editions.