Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods
Authors:
- Jieyu Zhao
- Tianlu Wang
- Mark Yatskar
- Vicente Ordonez
- Kai-Wei Chang
Primary Institutions:
- University of California, Los Angeles
- University of Virginia
- Allen Institute for Artificial Intelligence
Abstract Summary:
The paper introduces a novel benchmark, WinoBias, designed to evaluate gender bias in coreference resolution systems. The benchmark consists of Winograd-schema style sentences with gendered pronouns referring to individuals in various occupations. Evaluations of three different coreference systems—rule-based, feature-rich, and neural—reveal a substantial gender bias favoring pro-stereotypical over anti-stereotypical pronoun-entity links, quantified by an average F1 score difference of 21.1. The paper proposes a data augmentation approach combined with existent word-embedding debiasing techniques to mitigate this bias without adversely affecting the systems' performance on standard coreference benchmarks.
Introduction:
Coreference resolution, essential in various NLP applications, identifies mentions referring to the same entity within a text. Existing systems, including rule-based, feature-rich, and neural models, may unintentionally encode and propagate societal stereotypes present in training data. The WinoBias benchmark is constructed to critically evaluate this phenomenon by introducing sentences where pronouns need to be linked to stereotypical or anti-stereotypical gender roles based on U.S. Department of Labor statistics.
Key Contributions:
- WinoBias Benchmark:
- Contains sentences involving 40 different occupations.
- Tests systems' performance on pro-stereotypical vs. anti-stereotypical pronoun resolutions.
- Average differences in F1 score performance indicate significant gender bias.
- Evaluation of Existing Systems:
- Three representative coreference resolution systems evaluated:
- Stanford Deterministic Coreference System
- Berkeley Coreference Resolution System
- UW End-to-End Neural Coreference Resolution System
- All three systems exhibited considerable gender bias, with the rule-based system being the most biased.
- Three representative coreference resolution systems evaluated:
- Debiasing Methods:
- Data augmentation: Gender swapping within the training data to balance gender representation.
- Bias correction in supporting resources like word embeddings and gender-statistics-based list for noun phrases.
- Both approaches proved effective individually and even more so when combined, eliminating bias in WinoBias evaluations without significantly harming overall performance metrics.
Practical and Theoretical Implications:
Practical Implications:
- The procedures proposed for debiasing can be systematically applied to improve the fairness of various NLP models beyond coreference resolution.
- Strategies like gender swapping in data augmentation can serve as blueprints for mitigating bias in underrepresented classes across different datasets.
Theoretical Implications:
- The findings highlight the need to systematically address biases inherent in widely-used NLP datasets and supporting resources.
- The research strengthens theoretical understanding of how biases are encoded and propagated through machine learning models, especially in structured prediction tasks.
Future Developments:
Several speculative areas for future research include:
- Extending bias detection methods to other demographic attributes such as race, age, or social class.
- Developing more sophisticated debiasing algorithms integrating causality and fairness principles within structured prediction tasks.
- Exploring the impact of multilingual and culturally diverse datasets to understand and mitigate biases in a global context.
Overall, the paper offers a rigorous evaluation of gender bias in coreference resolution systems and presents effective debiasing techniques, contributing substantially to the robustness and fairness of NLP applications.