Papers
Topics
Authors
Recent
2000 character limit reached

WinoBias Benchmark: Measuring Gender Bias

Updated 23 November 2025
  • WinoBias is a benchmark that uses controlled Winograd-schema-style sentences with occupation nouns and gendered pronouns to diagnose gender bias in NLP models.
  • The evaluation employs the F1 differential between pro-stereotypical and anti-stereotypical cases to reveal significant bias gaps across different coreference systems.
  • Debiasing strategies—including gender-swapping data augmentation and embedding neutralization—have shown to drastically reduce measurable bias while minimally affecting overall performance.

WinoBias is an adversarial benchmark for coreference resolution designed to diagnose and quantify gender bias in state-of-the-art systems. The benchmark presents Winograd-schema-style sentences featuring pairs of occupational nouns and gendered pronouns, thereby revealing systematic disparities in model performance on pro-stereotypical (gender-aligned) versus anti-stereotypical (gender-antialigned) cases. WinoBias has triggered a body of subsequent research analyzing bias in both static and contextual embeddings and evaluating the effectiveness of debiasing strategies in NLP pipelines (Zhao et al., 2018, Zhao et al., 2019, Dawkins, 2021).

1. Construction and Structure of the WinoBias Benchmark

WinoBias is based on a controlled generation of test cases using two Winograd-style templates instantiated with occupation nouns. The occupational lexicon comprises 40 professions selected from U.S. Department of Labor statistics, with explicit annotation of the percent female workforce for each occupation (see Table 1 in (Zhao et al., 2018)). Each occupation is marked as “female-stereotyped” if ≥50% female (e.g., “nurse”: 90%) and “male-stereotyped” otherwise (e.g., “carpenter”: 2%).

Two sentence templates ensure syntactic and semantic challenge:

  • Type 1 (semantic): “[Entity₁] [interacts with] [Entity₂], and [Pronoun] [circumstances].”
  • Type 2 (syntactic + semantic): “[Entity₁] [interacts with] [Entity₂] and then [interacts with] [Pronoun] for [circumstances].”

Every sentence contains two different occupations, and only one is coreferent with the pronoun. The gender of the pronoun should not inform the coreference resolution if the system is unbiased.

Sentence instantiation covers all unordered pairs of occupations ($1560$ pairs), both pronoun genders, for a total of 2×1560=31202 \times 1560 = 3120 test sentences; the published corpus includes 3160 examples due to manual variants. Each example is labeled as pro-stereotypical if the occupation and pronoun gender follow the societal stereotype, anti-stereotypical otherwise. The data are split evenly into dev and test sets ($1580$ sentences each), with balanced proportions of pro/anti-stereotypical and Type 1/Type 2 examples (Zhao et al., 2018).

2. Bias Metrics and Evaluation Methodology

The primary bias metric is the F1 differential:

ΔF1=F1proF1anti\Delta F1 = F1_{pro} - F1_{anti}

where F1proF1_{pro} is the model’s F1 score on pro-stereotypical sentences and F1antiF1_{anti} on anti-stereotypical instances (Zhao et al., 2018). A system exhibiting no gender bias will have ΔF10\Delta F1 \approx 0. Statistical significance of F1proF1_{pro} vs. F1antiF1_{anti} is established via the approximate randomized test of Graham et al. (2014).

3. Empirical Findings: Benchmarking Modern Coreference Systems

WinoBias evaluation encompasses three representative systems:

  • Rule-based: Stanford Deterministic [Raghunathan et al. 2010]
  • Feature-rich: Berkeley [Durrett & Klein 2013]
  • Neural end-to-end: Lee et al. (2017)

All were evaluated out-of-the-box. Results on the dev set reveal substantial bias:

System Type 1 Pro Type 1 Anti Δ\DeltaF1 Type 2 Pro Type 2 Anti Δ\DeltaF1
Rule-based 76.7 37.5 39.2* 50.5 29.2 21.3*
Feature-rich 67.2 59.3 7.9* 81.4 82.3 0.9
Neural 76.0 49.4 26.6* 88.7 75.2 13.5*

*Indicates p<0.05p < 0.05 for the pro vs. anti difference. The average absolute bias gap ΔF1|\Delta F1| across systems is $21.1$ points. Type 1 (semantic) examples show larger bias gaps than Type 2 (Zhao et al., 2018).

Further research demonstrated that contextualized embeddings such as ELMo systematically encode and amplify gender bias when integrated into downstream coreference models, yielding even larger bias gaps (semantic-only: Δ\Delta = 29.6% for GloVe+ELMo compared to 26.6% for GloVe-only) (Zhao et al., 2019).

4. Debiasing Approaches

Mitigating gender bias in coreference resolution—specifically as exposed by WinoBias—comprises three strategies (Zhao et al., 2018, Zhao et al., 2019):

  1. Data Augmentation via Gender-Swapping:
    • For every OntoNotes 5.0 training document, generate a gender-swapped version by systematic replacement of all gendered pronouns and kinship terms, post-named-entity anonymization.
    • Retrain models on the union of original and swapped data, ensuring balanced gender supervision. This augmentation drastically reduces bias gaps at negligible performance cost.
  2. Resource Debiasing (Hard Debias for Static Embeddings):
    • Replace original GloVe embeddings with “Hard Debiased” vectors (see Bolukbasi et al. 2016). For each word, the gender subspace component is removed for gender-neutral terms and equalized for gender-definitional word pairs.
  3. External Gender List Balancing (Feature-Rich Systems):
    • Rebalance external pronoun–noun statistics so that each noun phrase is equally associated with male and female forms (Zhao et al., 2018).

For ELMo-based contextual embeddings, two mitigation approaches were studied (Zhao et al., 2019):

  • Train-time data augmentation (swap+retrain), as above.
  • Test-time embedding neutralization: For each input, average token embeddings from the sentence and its gender-swapped version, stripping the gendered component that flips under swapping. This notably reduces bias only on cases with strong syntactic cues.

5. Results After Debiasing and Robustness

Debiasing protocols remove nearly all measurable gender bias on WinoBias while preserving core OntoNotes performance.

  • Neural System (Lee et al. 2017):
    • Pre-debias OntoNotes F1 = 67.7; post-debias = 66.3 (Δ\Delta = –1.4)
    • WinoBias Type 1: pro = 63.9, anti = 62.8, ΔF1|\Delta F1| = 1.1
    • WinoBias Type 2: pro = 81.3, anti = 83.4, ΔF1|\Delta F1| = 2.1
  • Feature-rich System (Durrett & Klein 2013):
    • Pre-debias OntoNotes F1 = 61.7; post-debias = 61.0 (Δ\Delta = –0.7)
    • WinoBias Type 1: pro = 62.9, anti = 58.3, ΔF1|\Delta F1| = 4.6
    • WinoBias Type 2: pro = 68.5, anti = 57.8, ΔF1|\Delta F1| = 10.7

Analysis of gender-swapped OntoNotes dev sets shows F1 drops of only $0.5$–$1.0$ points, confirming that robust systems do not over-rely on gendered statistics given strong local context cues (Zhao et al., 2018, Zhao et al., 2019). Data augmentation reduces the ELMo-based bias gap (semantic-only subset) from 29.6% to 1.0%, with a minor drop in OntoNotes F1 (71.0 from 72.7) (Zhao et al., 2019).

6. Extensions and Critiques: From WinoBias to SoWinoBias

Subsequent work has identified a limitation of WinoBias: it targets only first-order, explicit gender bias—bias that manifests clearly when the pronoun is gendered (“he”/“she”) and co-occurs with occupational stereotypes. Real-world coreference phenomena often lack explicit gender cues, implicating “second-order” or latent bias (Dawkins, 2021). The SoWinoBias test set addresses this gap by constructing sentences using two occupations and a gender-neutral pronoun (“they”), where stereotyped adjectives function as gendered cues (e.g., “beautiful” as female-coded).

This new paradigm enables evaluation of bias even when explicit gender words are absent. Empirical evidence from SoWinoBias reveals that while debiasing static embeddings (e.g., Hard-GloVe, GN-GloVe) combined with data augmentation can reduce explicit bias to near zero (Δ\DeltaF1 \approx 2.1), latent bias (as measured by SoWinoBias Δ\DeltaAcc) often remains substantial unless augmentations target both occupation–adjective composition and pronoun gender (Dawkins, 2021).

7. Broader Implications and Future Directions

WinoBias occupies a central methodological role in the paper of gender bias in NLP, functioning both as a diagnostic tool and a stress test for model fairness. The cascading influence from static and contextualized embeddings to downstream performance is demonstrated by pronounced bias gaps, which can be alleviated via data-level interventions. However, the resilience of second-order bias—exposed by extensions like SoWinoBias—signals a persistent challenge: simple de-biasing of embedding spaces and balanced supervision may not fully address deeper compositional and contextual biases.

A plausible implication is that bias mitigation must target not only static embeddings but also context-aware models and data representations, possibly via joint debiasing or broader fairness frameworks crossing multiple NLP tasks (Zhao et al., 2018, Zhao et al., 2019, Dawkins, 2021).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to WinoBias Benchmark.