Gender Bias in Coreference Resolution
- Gender bias in coreference resolution is defined by systematic misalignment between gendered pronouns and occupations, revealing underlying societal stereotypes.
- Empirical benchmarks using datasets like WinoBias and Winogender demonstrate significant disparities in model accuracy between pro-stereotypical and anti-stereotypical cases.
- Mitigation strategies such as counterfactual data augmentation and embedding debiasing effectively reduce bias gaps while preserving model performance.
Gender bias in coreference resolution refers to systematic errors or disparities whereby NLP systems resolve co-referring expressions (e.g., pronouns, occupations) in ways that favor or disadvantage particular genders, typically mirroring historical or societal stereotypes. This bias profoundly affects not only overall model fairness and robustness but also the development of equitable downstream applications. Modern research has produced numerous diagnostic datasets, metrics, and mitigation techniques, converging on the finding that gender bias is pervasive across rule-based, statistical, neural, and LLM architectures.
1. Benchmarks and Metrics for Measuring Gender Bias in Coreference Resolution
Standardized evaluation of gender bias is enabled by specialized datasets and formally defined metrics. Early diagnostic corpora such as WinoBias and Winogender Schemas employ synthetically controlled, Winograd-style sentences that systematically vary pronoun gender and occupation to isolate pronoun-occupational associations. In WinoBias (Zhao et al., 2018), for example, each sentence is instantiated twice: once with a pro-stereotypical pronoun–occupation pairing (e.g., "The nurse ... she") and once with the anti-stereotypical reversal ("The nurse ... he"), matched for syntactic and semantic context.
Bias is typically quantified as the absolute difference in coreference F₁-score or accuracy between pro-stereotypical and anti-stereotypical conditions: A perfectly impartial system yields .
The Winogender corpus (Rudinger et al., 2018) applies similar controls to occupational domains but includes minimal-pair templates differing only by pronoun gender, supporting fine-grained bias attribution. GAP (Webster et al., 2018) introduces a large, naturally occurring, gender-balanced corpus of ambiguous pronoun references, specifically constructed to measure real-world and model-induced biases in more diverse contexts. However, subsequent analyses have demonstrated distributional imbalances in candidate count and pronoun–antecedent distance for gender subgroups, necessitating post-hoc sample weighting to isolate “model bias” from “dataset bias” (Kocijan et al., 2020).
Recent methodological advances introduce instance-level and counterfactual evaluations. For example, "Counter-GAP" (Xie et al., 2023) proposes minimally-distant quadruple pairs representing all permutations of candidate name order and pronoun gender, and defines bias as the difference in inconsistency rates across versus within gender permutations. Benchmarking on “latent” (second-order) biases—i.e., those persisting without overt gender cues—emerged with SoWinoBias (Dawkins, 2021), which probes whether coreference models exhibit stereotypical performance disparities even when explicit gender markers are absent.
With the advent of WinoPron (Gautam et al., 9 Sep 2024), case balance (nominative, accusative, possessive), template rigor, and neopronoun inclusion are enforced, and a continuous pronominal bias score is introduced. The field now employs metrics such as aggregate occupation bias (AOB), consistency, and intersectional confidence disparity (Khan et al., 9 Aug 2025) to audit model performance along increasingly fine social axes.
2. Empirical Evidence: Gender Bias in Coreference Systems
Substantial empirical work demonstrates persistent gender bias in both traditional and neural coreference models. In WinoBias and Winogender evaluations, off-the-shelf systems—rule-based [Lee et al., 2011], feature-rich [Durrett & Klein, 2013], and neural [Clark & Manning, 2016]—all exhibit significantly higher accuracy on pro-stereotypical than anti-stereotypical pronoun–occupation pairs. For instance, neural models display a gap (type-1) and $0.9$ (type-2), while rule-based systems reach (Zhao et al., 2018). Minimal-pair divergences in decision rates approach 13% for neural and 68% for rule-based systems (Rudinger et al., 2018).
On real-world datasets such as GAP, models trained on OntoNotes or WikiCoref consistently yield F₁-scores for feminine pronouns 5–25% lower than masculine across all major approaches (Webster et al., 2018). When evaluated on the large-scale, naturally occurring BUG dataset (Levy et al., 2021), SpanBERT achieves F₁ = 65.1, but has a masculine–feminine gap () of 10.2 points and a stereotype-alignment gap () of 6.0 points.
Contextualized embedding-based models (e.g., ELMo) further amplify gender bias. Training on corpora with a pronounced male pronoun skew (5.3M male vs. 1.6M female tokens) leads to geometric and functional embedding asymmetries, measurable via principal component analysis and classification accuracies (Zhao et al., 2019). Consequently, state-of-the-art coreference models can display up to 30 percentage-point disparities between pro- and anti-stereotypical splits.
Recent studies on LLMs confirm that even in the absence of explicit gender triggers, both confidence and accuracy exhibit marked intersectional disparities, with models expressing greatest uncertainty or failure for doubly minoritized (e.g., trans-woman or anti-stereotypical) identities (Khan et al., 9 Aug 2025).
3. Sources and Mechanisms of Gender Bias
Gender bias in coreference emerges from both historical data imbalances and architectural or optimization dynamics:
- Training corpora such as OntoNotes exhibit male mention ratios of 80%, resulting in overrepresentation of masculine gender cues in model learning and statistics (Zhao et al., 2018, Rudinger et al., 2018).
- Pretrained embeddings encode gendered directions aligned with societal stereotypes, propagating bias to downstream coreference layers; even after debiasing, residual associations can be detected via WEAT and GIPE metrics (Dawkins, 2021).
- Model optimization exacerbates bias. As network loss decreases, aggregate occupation bias (AOB) increases monotonically; models “amplify” existing biases unless debiasing interventions are implemented (Lu et al., 2018).
- Inference heuristics, including gold cluster scoring and pointer mechanisms, implicitly learn gender–occupation shortcuts, especially when spurious but predictive in the training set.
- LLMs reveal bias not only in output selection but also in uncertainty (coreference confidence disparity): privileged or unmarked identities receive higher confidence and more determinate coreference assignments than disprivileged groups (Khan et al., 9 Aug 2025).
Empirical studies further show human annotators bring their own cognitive stereotypes, especially under time pressure (System 1), and that model biases often resemble human shortcut reasoning—yet, on synthetic data, models can be even more biased than humans (Lior et al., 2023).
4. Mitigation Strategies and Experimental Outcomes
A broad spectrum of debiasing techniques has been developed:
Counterfactual Data Augmentation (CDA):
CDA involves systematically doubling the training set via interventions that swap all gendered words in each instance, retaining original labels. The method can be “naive” (dictionary-based swaps) or “grammatical” (POS-sensitive, no proper-noun swaps) (Lu et al., 2018, Zhao et al., 2018). CDA consistently reduces AOB or F₁ bias gaps by 66–83% with negligible or positive loss in accuracy. For example, Lee et al. models improve test F₁ from 67.20 to 67.40 while AOB drops from 3.00 to 1.03 (Lu et al., 2018).
Embedding Debiasing:
Techniques such as hard neutralization [Bolukbasi et al., 2016], GN-GloVe [Zhao et al., 2018], or iterative nullspace projection (INLP) target the removal of direct and indirect gender components from static embeddings. These approaches decrease but generally do not eliminate bias. Composition with CDA further improves bias removal (Zhao et al., 2018, Dawkins, 2021).
Corpus Correction and Resource Regularization:
Editing corpus-derived gender-count lists and balancing gender in occupation statistics contribute to model neutrality (Rudinger et al., 2018). Template and instance balancing, as in WinoPron, are now required to avoid confounding grammatical case with gender (Gautam et al., 9 Sep 2024).
Model Architecture Adjustments and Adversarial Debiasing:
Incorporating explicit debiasing regularizers (e.g., L2 penalty for F₁ disparity across genders), adversarial objectives, or multi-task gender classifiers with gradient reversal, further aligns model predictions across gender categories (Cao et al., 2019). Relational GCNs leveraging syntactic structure, as in (Xu et al., 2019), reduce over-reliance on lexical gender cues.
Intersectional and Distributional Mitigations:
Counterfactual augmentation of candidate names (not only pronouns or role nouns) via optimal male–female pairings enhances resistance to “name artifacts” and reduces quadruple-level inconsistency (Xie et al., 2023). For LLMs, anti-stereotypical or intersectionally labeled data can close confidence gaps, but may only temporarily patch memorization-driven invalidity (Khan et al., 9 Aug 2025).
Empirical results repeatedly support that the most robust mitigation arises from a combination of data augmentation, lexical/embedding debiasing, and explicit regularization, with minimal cost to in-domain accuracy on OntoNotes or CoNLL.
5. Beyond the Binary: Non-Binary, Neopronoun, and Intersectional Bias
Recent movement expands bias audits and mitigations beyond binary gender. Gender-Inclusive Coref (Cao et al., 2019) and WinoPron (Gautam et al., 9 Sep 2024) incorporate non-binary pronouns (they/them), neopronouns (xe/xem/xyr), and explicit case balancing. Models, including instruction-tuned FLAN-T5-XXL, generalize reasonably to unseen neopronouns yet most supervised systems exhibit substantial performance and bias degradation—e.g., F₁ for xe primes drops well below chance and gap to binary pronouns exceeds 20 percentage points.
Intersectional benchmarks such as WinoIdentity (Khan et al., 9 Aug 2025) reveal that bias is compounded for doubly disadvantaged groups (e.g., trans women, certain racial or socio-economic identities). Coreference confidence drops not only for marginalized markers but also for privileged ones, implicating memorization, not reasoning, as the underlying mechanism.
Studies such as (Bartl et al., 18 Feb 2025) apply psycholinguistic paradigms to test whether LLMs respect gender-inclusive antecedent cues. Even with explicit neutralization (singular they), LLMs in English and German preferentially generate masculine referents or coreferents, regardless of the inclusion strategy.
6. Best Practices, Recommendations, and Limitations
Meta-analyses prescribe several best practices:
- Report bias metrics (AOB, , confidence disparity) on multiple diagnostic sets before and after mitigation.
- Balance templates not only by gender, but by grammatical case and distractor configuration (Gautam et al., 9 Sep 2024).
- Employ large, diverse, in-the-wild diagnostic corpora (e.g., BUG, Counter-GAP) for fine-tuning and evaluation, rather than relying solely on synthetic data (Levy et al., 2021, Xie et al., 2023).
- Combine augmentation and embedding-level debiasing, but avoid naive post-training neutralization, especially with co-trained embeddings (Lu et al., 2018).
- Measure on subpopulations (e.g., by pronoun type, intersectional identity) and continuously monitor bias during training.
Limitations persist. Most research focuses on English; results may not generalize to languages with rich grammatical gender. Non-binary and neopronoun coverage remains incomplete outside the latest diagnostic benchmarks. Attention to dataset construction and explicit modeling of cognitive similarity between human and model biases is necessary for principled fairness advances (Lior et al., 2023).
7. Outlook and Open Challenges
Addressing gender bias in coreference resolution is an ongoing challenge requiring integration of data-centric, model-based, and evaluation-based solutions. While substantial progress has been made in attribute-level and binary gender scenarios, complex social dimensions such as intersectionality, non-binarity, and fairness-under-uncertainty demand further methodological innovation. Distinguishing between representational validity (model makes correct logical inferences) and value alignment (model makes fair decisions) remains essential for the safe deployment of coreference technology in socially critical applications (Khan et al., 9 Aug 2025). Continued collaboration with social scientists, expansion to multilingual and multimodal corpora, and the adoption of advanced debiasing and evaluation strategies are crucial for the next generation of equitable coreference systems.