Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Predicting Race and Ethnicity From the Sequence of Characters in a Name (1805.02109v2)

Published 5 May 2018 in stat.AP and stat.ML

Abstract: To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various racial groups, and to news data to estimate the coverage of various races and ethnicities in the news.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Name-ethnicity classification from open sources. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining. ACM pp. 49–58.
  2. Ansolabehere, Stephen and Eitan Hersh. 2011. “Gender, race, age, and voting: A research note.”.
  3. Bertrand, Marianne and Sendhil Mullainathan. 2004. “Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination.” American economic review 94(4):991–1013.
  4. Bonica, Adam. 2017. “Database on ideology, money in politics, and elections (DIME).”.
  5. Census Bureau. 2016. “Decennial Census Surname Files (2010, 2000).”. Data retrieved from The United States Census Bureau Website, https://www.census.gov/data/developers/data-sets/surnames.html.
  6. Fiscella, Kevin and Allen M Fremont. 2006. “Use of geocoding and surname analysis to estimate race and ethnicity.” Health services research 41(4p1):1482–1500.
  7. “Learning to forget: Continual prediction with LSTM.”.
  8. Graves, Alex and Jürgen Schmidhuber. 2005. “Framewise phoneme classification with bidirectional LSTM and other neural network architectures.” Neural Networks 18(5-6):602–610.
  9. Imai, Kosuke and Kabir Khanna. 2016. “Improving ecological inference by predicting individual ethnicity from voter registration records.” Political Analysis 24(2):263–272.
  10. Kingma, Diederik P and Jimmy Ba. 2014. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 .
  11. Kotova, Nadia. N.d. “A Deep Learning Approach to Predicting Race Using Personal Name and Location (Natural Language Processing).” . Forthcoming.
  12. Lucas, Ou-Yang. 2023. “Newspaper3k: Article scraping and curation.”. https://github.com/codelucas/newspaper
  13. Parasurama, Prasanna. 2021. “raceBERT–A Transformer-based Model for Predicting Race from Names.” arXiv preprint arXiv:2112.03807 .
  14. “Race and ethnicity data for first, middle, and last names.” arXiv preprint arXiv:2208.12443 .
  15. Sood, Gaurav. 2017. “Florida Voter Registration Data.”. https://doi.org/10.7910/DVN/UBIG3F
  16. Sood, Gaurav and Suriyan Laohaprapanon. 2018. “Predicting race and ethnicity from the sequence of characters in a name.” arXiv preprint arXiv:1805.02109 .
  17. Willis, Derek and Gaurav Sood. 2023. “Top News: Story URLs and Text from News Feeds of Major National News Sites.”. https://doi.org/10.7910/DVN/ZNAKK6
  18. Nationality classification using name embeddings. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 1897–1906.
Citations (175)

Summary

  • The paper introduces novel character-based sequence models, including LSTM and Transformers, which integrate first and last names to enhance race and ethnicity prediction.
  • The paper employs diverse methodologies such as KNN, random forests, and gradient boosted trees to transform names into normalized representations, overcoming limitations of traditional approaches.
  • The paper demonstrates practical applications by analyzing political contributions and media coverage, revealing improved prediction accuracy and reduced bias in racial inference.

Predicting Race and Ethnicity from Names: An Analysis of Character-Based Models

Racial inequality and fairness are significant concerns within various societal domains, ranging from healthcare to political contributions and news coverage. A crucial aspect in addressing these issues is the ability to infer race and ethnicity from names. Traditional methods, such as those utilizing the Census Bureau's list of popular last names, present limitations, including a focus on last names only, a bias towards popular names, and infrequent updates. This paper presents advanced methodologies for predicting race and ethnicity from both first and last names, leveraging character-based models to improve prediction accuracy.

The research utilizes diverse datasets, primarily focusing on the Florida Voting Registration data which provides self-reported race information for approximately 15 million voters. The paper also uses Census Popular Last Name data and North Carolina Voter Registration data for modeling purposes. The paper tackles the limitations of previous approaches by demonstrating improved generalization across racial groups and enhancing accuracy, particularly when first names are included.

Methodology

The paper explores multiple models, ranging from simple K-nearest neighbor (KNN) approaches to complex Long Short-Term Memory (LSTM) and Transformer models. Specifically:

  • KNN models: Utilize edit-distance and cosine distance metrics, helping predict race and ethnicity when dealing with names not covered in popular databases or when spelling errors may be present.
  • Random Forests & Gradient Boosted Trees: Employ a 'Bag of Characters' representation for names, providing a statistical basis for learning ethno-racial relationships.
  • LSTM: Introduces a sequence-based prediction mechanism, capturing the character-order dynamics in names.
  • Transformer models: Enhance sequence-based learning, focusing on positional encoding of characters.

The models process names by transforming them to a normalized format and grouping by last or full names, simplifying the complexity in modeling with a focus on modal race prediction.

Results

The results highlight the superiority of LSTM models, achieving an out-of-sample accuracy of 0.85 for full-name models and 0.81 for last-name models. The inclusion of first names boosts predictive value, particularly increasing accuracy in identifying Non-Hispanic Blacks to 74% compared to 21% when only last names were utilized. These models outperform traditional and other machine learning approaches, demonstrating robustness across different racial categories.

Applications

Two practical applications illustrate the model's utility:

  1. Political Contributions: Analysis of a 2014 campaign contribution database reveals that 89.5% of contributions were made by Non-Hispanic Whites, indicating significant racial imbalances in financial political influence.
  2. News Coverage: Evaluation of a dataset comprising news articles from major media outlets shows an overrepresentation of Non-Hispanic Whites, both among authors (78%) and in mentions (73.5%), signaling potential biases in media diversity.

Discussion

The paper provides a comprehensive evaluation of character-based models for race and ethnicity prediction, offering methodological advancements over traditional census-based methods. The LSTM models, alongside the KNN approaches, provide improved accuracy essential for societal applications where name-based racial inference is necessary. While the full list of names is unavailable, these models serve as a robust tool to address bias towards popular names and leverage information embedded within first names.

Further research could explore integrating synthetic data into training models to potentially enhance generalizability. Additionally, the ongoing development of AI presents opportunities for refining natural language processing tools for even greater precision in ethnic and racial prediction from names, contributing to fairer societal frameworks.