Bias Benchmark for Question Answering (BBQ)
- BBQ is a template-driven benchmark that evaluates QA systems' susceptibility to stereotyping by simulating both ambiguous and disambiguated contexts.
- It employs multiple-choice questions across nine social dimensions, using paired templates to reveal how models rely on prior stereotypes versus explicit evidence.
- Empirical findings indicate that even high-performing models often amplify biases, prompting the use of counter-stereotype augmentation and bias-aware fine-tuning for mitigation.
The Bias Benchmark for Question Answering (BBQ) is a template-driven benchmark designed to systematically evaluate the susceptibility of question answering (QA) systems to reflect and amplify attested social stereotypes in their outputs. BBQ differentiates itself from earlier bias benchmarks by simulating pragmatic ambiguity and real-world information gaps, thereby revealing whether models hallucinate stereotype-consistent answers under uncertainty, and if they can override such priors when explicit disambiguating evidence is provided. Initially introduced for U.S. English and centered on nine social axes (with intersectional extensions), BBQ now underpins a wide array of cross-linguistic, open-ended, implicit, and modality-specific bias evaluation frameworks.
1. Design Principles and Architecture
BBQ is constructed as a hand-built multiple-choice QA dataset along nine top-level social dimensions relevant in U.S. English: age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socio-economic status, and sexual orientation, plus two intersectional axes (race × gender, race × SES) (Parrish et al., 2021). Each dimension is represented by 25–50 templates sourced from documented stereotypes (academic literature, media, Wikipedia). Templates specify two demographically distinct entity slots—target and non-target of the presumed stereotype—and give rise to four variants per template through the Cartesian product of:
- Negative vs. non-negative questions (e.g., “Who is bad at math?” vs. “Who is good at math?”),
- Ambiguous (under-informative) vs. disambiguated (adequately informative) contexts.
Ambiguous contexts omit any disambiguating fact, requiring the correct answer to be “Unknown.” Disambiguated contexts add an explicit fact that determines which entity has the queried attribute, allowing evaluation both when QA models should rely only on prior (ambiguity) and when they must utilize textual evidence (disambiguation).
2. Evaluation Protocols and Bias Metrics
BBQ operationalizes model evaluation under two distinct regimes:
- Ambiguous (under-informative) context: No entity in the context can be definitively linked to the queried attribute; the only correct answer is “Unknown.”
- Disambiguated (adequately informative) context: The context contains explicit evidence as to which entity satisfies the query.
Key quantitative metrics employed in BBQ are:
- Accuracy, measured separately for ambiguous and disambiguated contexts.
- Bias Score , with in disambiguated settings, where is the number of stereoytpe-aligned outputs and the number of non-“unknown” outputs.
- Ambiguous-context bias .
A positive denotes responses aligning with the stereotype; negative values indicate counter-bias. These metrics accurately decouple model accuracy from the magnitude and directionality of bias, enabling fine-grained attribution of errors (Parrish et al., 2021).
3. Empirical Findings and Bias Patterns
Empirical evaluation of various QA models—including UnifiedQA 11B, RoBERTa, and DeBERTaV3—exposes systematic tendencies:
- In ambiguous contexts, large models often fail to select “Unknown,” preferring stereotype-targeted entities, with bias scores up to in UnifiedQA. Smaller models exhibit the same bias direction, if weaker.
- In disambiguated contexts, while accuracy can reach 90%, models maintain higher accuracy where the ground truth aligns with stereotypes (difference up to 3.4 percentage points overall, and exceeding 5 points for gender-targeted items), indicating persistent stereotype amplification even when explicit evidence contradicts bias (Parrish et al., 2021).
- Breakdown by bias axis shows highest ambiguous bias for physical appearance (especially obesity), gender, and SES; lower but still positive bias for race/ethnicity and sexual orientation.
Critically, larger and more capable models sometimes amplify bias, challenging the assumption that overall accuracy is a proxy for fairness (Parrish et al., 2021).
4. Extensions: Open-ended, Implicit, and Multimodal BBQ
BBQ’s architecture has been adapted to a variety of downstream and cross-linguistic settings:
- Open-BBQ adds fill-in-the-blank and short-answer formats to capture open-ended bias in generative LLMs. Bias scores in open settings often exceed those in multiple-choice formats; chain-of-thought and few-shot debiasing reduce average bias from 0.10–0.40 to ≈0 (Liu et al., 9 Dec 2024).
- ImplicitBBQ recasts explicit cues as implicit, relying on indirect demographic signals (names, occupations, clothing) to evaluate bias under more realistic conditions. GPT-4o shows up to 7% accuracy drop in ambiguous implicit prompts (e.g., sexual orientation, race/ethnicity), revealing latent biases undetected by explicit-only benchmarks (Wagh et al., 7 Dec 2025).
- VoiceBBQ transforms text inputs into 16 controlled audio variants (gender × accent), enabling analysis of both content-driven and acoustic bias in spoken LLMs. Standard architectures, such as LLaMA-Omni, display pronounced changes (up to 5% in bias score) depending on speaker voice and accent, distinctly from their textual backbone’s bias (Choi et al., 25 Sep 2025).
- Minimal Contextual Augmentation demonstrates that minor, adversarial scenario rewrites result in marked increases (double or more) in decisive, stereotype-driven responses, especially for less-studied bias axes such as age and disability (Miandoab et al., 27 Oct 2025).
5. Multilingual and Culturally Adapted Variants
The BBQ paradigm has motivated a suite of culturally attuned and cross-lingual benchmarks, each with distinct validation and adaptation protocols:
| Benchmark | Language(s) / Region | Categories / Size | Key Adaptations |
|---|---|---|---|
| KoBBQ | Korean | 12, 76K samples | SIMPLY-TRANSFERRED/ TARGET-MODIFIED/ NEW/REMOVED; 4 new axes; national surveys, cultural entity mapping (Jin et al., 2023) |
| JBBQ | Japanese | 5, ~51K samples | Machine translation, manual paraphrase, Japanese-specific templates (Yanaka et al., 4 Jun 2024) |
| MBBQ | English, Dutch, Spanish, Turkish | 6, 10K samples | Cross-lingual stereotypes, parallel control set (no group cues), significance tests for QA skill (Neplenbroek et al., 11 Jun 2024) |
| BharatBBQ | 8 Indian languages | 16, ~393K samples | Cultural template vetting, intersectionality, back-translation validation, stereotype- and QA-specific metrics (Tomar et al., 9 Aug 2025) |
| PakBBQ | English, Urdu | 8, 17K samples | Four-way template adaptation, region/language-specific biases, dual framing (positive/negative questions) (Hashmat et al., 13 Aug 2025) |
| GG-BBQ | German | Gender (2 subsets) | Manual correction for grammatical gender, group vs. names, match to German culture (Satheesh et al., 22 Jul 2025) |
Analysis consistently finds that translation alone is insufficient—cultural adaptation, template vetting by local annotators, and inclusion of new or modified axes are critical for validity. Models display stronger bias in morphologically complex or low-resource languages (e.g., Urdu, Bengali, Turkish), even with matched templates (Neplenbroek et al., 11 Jun 2024, Hashmat et al., 13 Aug 2025, Jin et al., 2023).
6. Mitigation Strategies and Methodological Critiques
Mitigation strategies evaluated with BBQ and derivatives include:
- Data augmentation with counter-stereotype scenarios, particularly in ambiguous settings, to train models away from stereotype “gap filling.”
- Prompt engineering, including instructions to select “unknown” when evidence is lacking, warnings about social biases, and explicit chain-of-thought reasoning. Automated pipelines achieve ~99.9% label extraction accuracy and near-zero bias scores post-debiasing in open-ended tasks (Liu et al., 9 Dec 2024, Yanaka et al., 4 Jun 2024).
- Bias-aware fine-tuning that incorporates both balanced class distributions and manually curated adversarial examples.
- Culturally-aware auditing, requiring benchmarks not merely translated but genuinely adapted and validated by local experts and communities.
Several critiques of BBQ’s metrics and protocol have emerged:
- The coupled nature of and can mask bias when models simply abstain (“unknown”) in disambiguated settings, or when prompt-based mitigation produces evasive strategies at the cost of accuracy (Yang et al., 12 Mar 2025).
- Bias scores can be artificially reduced if models overproduce “unknown” rather than engage in contextual reasoning.
- The prevalence of stereotype-driven answers rises dramatically under adversarial scenario rewrites, even when semantic content is preserved, illustrating brittleness of alignment to known benchmark templates (Miandoab et al., 27 Oct 2025).
Recommendations include reporting per-choice distributions, explicitly penalizing incorrect abstentions, and developing robustness metrics against adversarial and cross-linguistic perturbations.
7. Impact, Current Limitations, and Recommendations
BBQ and its extensions have established the empirical foundations for robustness-oriented, culturally grounded bias diagnostics in QA systems. Their design informs regulatory and deployment guidelines, mitigates the risk of stereotype amplification, and serves as the canonical testbed for both model evaluation and debiasing research.
Current limitations include coverage gaps in certain bias axes for some languages, challenges with implicit attribute induction, and distortions introduced by imperfect translation or cultural adaptation (Wagh et al., 7 Dec 2025, Neplenbroek et al., 11 Jun 2024, Jin et al., 2023). Further, mitigation strategies based solely on prompting show limited efficacy against implicit, modal, and adversarially-injected bias, necessitating iterative dataset augmentation, metric refinement, and human-in-the-loop validation.
Future work prioritizes the expansion of template diversity (dialogues, multi-turn exchanges), incorporation of complex intersectionality, evaluation under realistic task settings (free text, multimodal input), and the creation of benchmarks capturing both explicit and implicit biases across all global cultural contexts. BBQ remains an evolving framework for detecting, quantifying, and ultimately remediating social bias in next-generation QA and LLMs (Parrish et al., 2021, Liu et al., 9 Dec 2024, Wagh et al., 7 Dec 2025, Jin et al., 2023, Tomar et al., 9 Aug 2025, Miandoab et al., 27 Oct 2025, Yang et al., 12 Mar 2025).