Predicting Race and Ethnicity From the Sequence of Characters in a Name (1805.02109v2)

Published 5 May 2018 in stat.AP and stat.ML

Abstract: To answer questions about racial inequality and fairness, we often need a way to infer race and ethnicity from names. One way to infer race and ethnicity from names is by relying on the Census Bureau's list of popular last names. The list, however, suffers from at least three limitations: 1. it only contains last names, 2. it only includes popular last names, and 3. it is updated once every 10 years. To provide better generalization, and higher accuracy when first names are available, we model the relationship between characters in a name and race and ethnicity using various techniques. A model using Long Short-Term Memory works best with out-of-sample accuracy of .85. The best-performing last-name model achieves out-of-sample accuracy of .81. To illustrate the utility of the models, we apply them to campaign finance data to estimate the share of donations made by people of various racial groups, and to news data to estimate the coverage of various races and ethnicities in the news.

References (18)

Citations (175)

View on Semantic Scholar

Summary

The paper introduces novel character-based sequence models, including LSTM and Transformers, which integrate first and last names to enhance race and ethnicity prediction.
The paper employs diverse methodologies such as KNN, random forests, and gradient boosted trees to transform names into normalized representations, overcoming limitations of traditional approaches.
The paper demonstrates practical applications by analyzing political contributions and media coverage, revealing improved prediction accuracy and reduced bias in racial inference.

Predicting Race and Ethnicity from Names: An Analysis of Character-Based Models

Racial inequality and fairness are significant concerns within various societal domains, ranging from healthcare to political contributions and news coverage. A crucial aspect in addressing these issues is the ability to infer race and ethnicity from names. Traditional methods, such as those utilizing the Census Bureau's list of popular last names, present limitations, including a focus on last names only, a bias towards popular names, and infrequent updates. This paper presents advanced methodologies for predicting race and ethnicity from both first and last names, leveraging character-based models to improve prediction accuracy.

The research utilizes diverse datasets, primarily focusing on the Florida Voting Registration data which provides self-reported race information for approximately 15 million voters. The paper also uses Census Popular Last Name data and North Carolina Voter Registration data for modeling purposes. The paper tackles the limitations of previous approaches by demonstrating improved generalization across racial groups and enhancing accuracy, particularly when first names are included.

Methodology

The paper explores multiple models, ranging from simple K-nearest neighbor (KNN) approaches to complex Long Short-Term Memory (LSTM) and Transformer models. Specifically:

KNN models: Utilize edit-distance and cosine distance metrics, helping predict race and ethnicity when dealing with names not covered in popular databases or when spelling errors may be present.
Random Forests & Gradient Boosted Trees: Employ a 'Bag of Characters' representation for names, providing a statistical basis for learning ethno-racial relationships.
LSTM: Introduces a sequence-based prediction mechanism, capturing the character-order dynamics in names.
Transformer models: Enhance sequence-based learning, focusing on positional encoding of characters.

The models process names by transforming them to a normalized format and grouping by last or full names, simplifying the complexity in modeling with a focus on modal race prediction.

Results

The results highlight the superiority of LSTM models, achieving an out-of-sample accuracy of 0.85 for full-name models and 0.81 for last-name models. The inclusion of first names boosts predictive value, particularly increasing accuracy in identifying Non-Hispanic Blacks to 74% compared to 21% when only last names were utilized. These models outperform traditional and other machine learning approaches, demonstrating robustness across different racial categories.

Applications

Two practical applications illustrate the model's utility:

Political Contributions: Analysis of a 2014 campaign contribution database reveals that 89.5% of contributions were made by Non-Hispanic Whites, indicating significant racial imbalances in financial political influence.
News Coverage: Evaluation of a dataset comprising news articles from major media outlets shows an overrepresentation of Non-Hispanic Whites, both among authors (78%) and in mentions (73.5%), signaling potential biases in media diversity.

Discussion

The paper provides a comprehensive evaluation of character-based models for race and ethnicity prediction, offering methodological advancements over traditional census-based methods. The LSTM models, alongside the KNN approaches, provide improved accuracy essential for societal applications where name-based racial inference is necessary. While the full list of names is unavailable, these models serve as a robust tool to address bias towards popular names and leverage information embedded within first names.

Further research could explore integrating synthetic data into training models to potentially enhance generalizability. Additionally, the ongoing development of AI presents opportunities for refining natural language processing tools for even greater precision in ethnic and racial prediction from names, contributing to fairer societal frameworks.