Predicting Race and Ethnicity From the Sequence of Characters in a Name (1805.02109v2)
Abstract: To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various racial groups, and to news data to estimate the coverage of various races and ethnicities in the news.
- Name-ethnicity classification from open sources. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining. ACM pp. 49–58.
- Ansolabehere, Stephen and Eitan Hersh. 2011. “Gender, race, age, and voting: A research note.”.
- Bertrand, Marianne and Sendhil Mullainathan. 2004. “Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination.” American economic review 94(4):991–1013.
- Bonica, Adam. 2017. “Database on ideology, money in politics, and elections (DIME).”.
- Census Bureau. 2016. “Decennial Census Surname Files (2010, 2000).”. Data retrieved from The United States Census Bureau Website, https://www.census.gov/data/developers/data-sets/surnames.html.
- Fiscella, Kevin and Allen M Fremont. 2006. “Use of geocoding and surname analysis to estimate race and ethnicity.” Health services research 41(4p1):1482–1500.
- “Learning to forget: Continual prediction with LSTM.”.
- Graves, Alex and Jürgen Schmidhuber. 2005. “Framewise phoneme classification with bidirectional LSTM and other neural network architectures.” Neural Networks 18(5-6):602–610.
- Imai, Kosuke and Kabir Khanna. 2016. “Improving ecological inference by predicting individual ethnicity from voter registration records.” Political Analysis 24(2):263–272.
- Kingma, Diederik P and Jimmy Ba. 2014. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 .
- Kotova, Nadia. N.d. “A Deep Learning Approach to Predicting Race Using Personal Name and Location (Natural Language Processing).” . Forthcoming.
- Lucas, Ou-Yang. 2023. “Newspaper3k: Article scraping and curation.”. https://github.com/codelucas/newspaper
- Parasurama, Prasanna. 2021. “raceBERT–A Transformer-based Model for Predicting Race from Names.” arXiv preprint arXiv:2112.03807 .
- “Race and ethnicity data for first, middle, and last names.” arXiv preprint arXiv:2208.12443 .
- Sood, Gaurav. 2017. “Florida Voter Registration Data.”. https://doi.org/10.7910/DVN/UBIG3F
- Sood, Gaurav and Suriyan Laohaprapanon. 2018. “Predicting race and ethnicity from the sequence of characters in a name.” arXiv preprint arXiv:1805.02109 .
- Willis, Derek and Gaurav Sood. 2023. “Top News: Story URLs and Text from News Feeds of Major National News Sites.”. https://doi.org/10.7910/DVN/ZNAKK6
- Nationality classification using name embeddings. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 1897–1906.