The Datasets We’re Looking At This Week
You’re reading Data Is Plural, a weekly newsletter of useful/curious datasets. Below you’ll find the Aug. 10, 2022, edition, reprinted with permission at FiveThirtyEight.
Young adult migration, social capital, CPUs and GPUs, trade in post-unification Italy and “Star Trek” computer talk.
Young adult migration. Researchers at Harvard University and the U.S. Census Bureau have linked federal tax filings, census records and other government data to track the migration patterns of young U.S. residents. Specifically, for each person born in the U.S. between 1984 and 1992, the researchers compared where they lived at age 16 to where they lived at age 26. The project’s public dataset counts the approximate number who moved to/from each pair of commuting zones — overall and disaggregated by race/ethnicity and parental income level. Read more: A reporting recipe from Brent Jones and Eric Schmid, who analyzed the data for St. Louis Public Radio.
Social capital. Using data on billions of Facebook connections and group memberships, Raj Chetty et al.’s Social Capital Atlas calculates three metrics for U.S. counties, ZIP codes, high schools and colleges: economic connectedness (friendships between low-income and high-income users), cohesiveness (how often users’ friends are also friends with one another) and civic engagement (membership in volunteer groups). Read more: The Upshot explores and explains the project’s findings. Previously: Measurements of social connectedness (DIP 2020.09.30) and economic mobility (DIP 2019.06.12) from some of the same researchers. [h/t Johannes Stroebel]
CPUs and GPUs. Yifan Sun et al., seeking to test Moore’s Law and Dennard scaling, “have collected data for all CPU and GPU products (to our best knowledge) that have been released by Intel, AMD […] and NVIDIA since January 1st, 2000.” The authors’ dataset and charting tool, describing 4,800-plus processors through early 2021, uses information gathered from TechPowerUp, WikiChip and company websites. They identify each product’s vendor, release date, transistor count, base frequency and other details. [h/t matt_d]
Trade in post-unification Italy. The Lost Highway project, a collaboration between researchers at four Italian universities, aims “to test a number of broad historical conjectures about the long-term shortcomings of the Italian development path by collecting as much quantitative evidence as possible.” Its Bankit-FTV database provides annual import and export totals for 1862 to 1939, by product and trading partner, with 6,000-plus product descriptions standardized into approximately 600 commodity groupings. [h/t Francesco Piccinelli Casagrande]
“Tea, Earl Grey, hot.” Combing the full transcripts of “Star Trek: The Next Generation,” Benett Axtell and Cosmin Munteanu found more than 1,000 lines of dialogue between the show’s characters and the starship Enterprise’s computer. Their dataset of these interactions lists each line’s phrasing, character, interaction type, stage directions and more. [h/t Christian A. Gebhard + Sara Stoudt + Tidy Tuesday]
Dataset suggestions? Criticism? Praise? Send feedback to email@example.com. Looking for past datasets? This spreadsheet contains them all. Visit data-is-plural.com to subscribe and to browse past editions.