The Datasets We're Looking At This Week
You’re reading Data Is Plural, a weekly newsletter of useful/curious datasets. Below you’ll find the Nov. 30, 2022, edition, reprinted with permission at FiveThirtyEight.
Pills, per-pupil spending, travelers’ coronavirus variants, Indonesia earthquake intensities and more roadkill.
Pills. From its launch in 2009 until its retirement last year, the National Library of Medicine’s Pillbox project collected and created 8,600-plus photographs of medical pills. The images, which are still available to download, are accompanied by a dataset that provides information on 83,000-plus pills’ shape, size, color, markings, dosage and other characteristics derived from drug labels. Related: The library’s DailyMed service provides frequently updated images and data from 140,000-plus labels submitted to the FDA for drugs and other regulated products. As seen in: Jon Keegan’s Pillbox overview in Beautiful Public Data. [h/t Giuseppe Sollazzo]
Per-pupil spending. The National Education Resource Database on Schools (“NERD$”) describes itself as the “first-ever national dataset of public K-12 spending by school.” Its researchers, based at Georgetown University, aggregate and standardize the expenditure disclosures that the Every Student Succeeds Act requires states to publish. You can explore and download the data they’ve processed for fiscal year 2019, including spending totals, enrollment counts and normalized figures that facilitate cross-state comparisons. For 2020 to 2022, you can access “the raw files we obtain from states while our team conducts validation checks and norms the data.” As seen in: “How much money do states spend on education?” (USAFacts). [h/t Douglas Hummel-Price]
Travelers’ coronavirus variants. In the past year, the CDC’s Traveler-Based Genomic Surveillance program has collected over 60,000 voluntary nasal swabs from people disembarking international flights at four major U.S. airports. The agency uses the samples as an “early warning system” to detect emerging SARS-CoV-2 variants and publishes weekly metrics that include participation counts, positivity rates (per pooled sample) and variant distributions. Read more: An interview with two private industry experts working on the program, by the COVID-19 Data Dispatch’s Betsy Ladyzhets.
Indonesia earthquake intensities. Gempa Nusantara, a database compiled by Stacey S. Martin et al., uses historical documents to catalog over 7,300 “macroseismic effects” of 1,200 earthquakes near Indonesia during a four-century span, from 1546 to 1950. It provides summaries of the local reports and categorizes the effects according to the European Macroseismic Scale, which focuses on the intensity of ground-shaking and potential impacts on buildings and terrain.
More roadkill. Florian Heigl et al. have compiled a pair of datasets containing 15,000-plus reports of vertebrate roadkill from 2014 to 2020, submitted by 900-plus people through a phone app. The datasets differ in identification confidence, but both provide locations, dates and taxonomic classifications. Although the records span 40-plus countries, the majority come from Austria, where the project is now focused. Previously: Andean roadkill (DIP 2021.07.07).