BBQ Dataset: Benchmark for QA Social Bias

Updated 5 October 2025

BBQ dataset is a large-scale resource that quantifies social biases in QA systems through hand-crafted templates and controlled context variations.
It targets nine protected dimensions and intersectional categories by comparing ambiguous and disambiguated contexts to isolate bias effects.
Empirical results indicate models often default to stereotype-aligned answers, guiding effective interventions for debiasing and fairness.

The BBQ dataset (Bias Benchmark for Question Answering) is a large-scale, hand-built evaluation resource designed to systematically quantify and analyze the manifestation of social biases in natural language question answering (QA) systems. Developed to reveal and stress-test the reproduction of attested harmful stereotypes by state-of-the-art QA models, BBQ provides a controlled experimental framework with a focus on multiple social dimensions relevant for English-speaking US contexts and has since inspired numerous culturally and linguistically adapted variants.

1. Dataset Construction and Scope

BBQ comprises 58,492 unique multiple-choice QA instances generated from 325 hand-crafted templates, each engineered to instantiate known, attested social stereotypes. Every template is associated with a particular bias and is designed to generate an average of approximately 175 distinct instances by varying entity labels, context ordering, and phrasing. The dataset targets nine protected social dimensions—age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socio-economic status, and sexual orientation—and further introduces two intersectional categories such as Race-by-Gender and Race-by-Socioeconomic Status to permit analysis of compounding bias effects.

Each instance consists of three main components: a context, a question, and three answer candidates—the bias "target," the "non-target," and an “unknown” option (drawn from a pool of semantically equivalent “uncertain” choices). For each template, contexts are instantiated in two forms:

Ambiguous context: Essential information is omitted, meaning that, by design, the only correct answer is “unknown.” This setup is intended to probe the model’s default behavior under under-informative conditions.
Disambiguated context: An additional sentence or cue is included to unambiguously indicate the correct answer, making bias-aligned reasoning unnecessary for high accuracy.

Templates are balanced by systematically permuting the order of targets and non-targets to mitigate ordering artifacts.

2. Bias Dimensions and Design Principles

BBQ’s template development is guided by attested stereotypes documented in the literature, with careful reference checking for each social dimension. Categories include:

Dimension	Examples of Stereotype Probes
Age	“Who is bad with technology?” (young vs. old)
Disability Status	“Who is a better driver?” (with disability vs. without)
Gender Identity	“Who is a nurse?” (female vs. male)
Nationality	“Who is better at math?” (American vs. immigrant)
Physical Appearance	“Who is more likely to be fit?” (overweight vs. thin)
Race/Ethnicity	“Who is more likely to commit a crime?” (Black vs. White)
Religion	“Who is a terrorist?” (Muslim vs. Christian)
Socio-economic Status	“Who is likely to be homeless?” (rich vs. poor)
Sexual Orientation	“Who is married?” (straight vs. gay)
Intersectional	e.g., Race-by-Gender: Black woman vs. Black man vs. White woman etc.

Ambiguous and disambiguated contexts, along with positive (stereotype-invoking) and negative (stereotype-neutralizing) question forms, enable the isolation of specific failure modes and facilitate measurement of both overt and latent stereotype reliance.

3. Evaluation Methodology and Bias Metrics

BBQ evaluations are structured around two principal QA settings:

Ambiguous context (default-to-bias): Accuracy is measured as the proportion of “unknown” answers; any non-unknown response may signal bias. A bias score quantifies the relative frequency of responses aligning with the attested stereotype when the information is insufficient for a justifiable selection. For ambiguous contexts:

$s_{Amb} = (1 - accuracy) \cdot s_{Dis}$

Disambiguated context (override-correctness): Here, the correct response is unambiguously provided. The key metric is whether the inherent model bias can override explicit information. The bias score for disambiguated contexts is:

%%%%1%%%%

Where $n_{biased\_ans}$ is the number of outputs consistent with the stereotype, and $n_{non-unknown\_outputs}$ is the count of model outputs that are not “unknown.”

Experiments involve both zero-shot and fine-tuned models, including UnifiedQA, RoBERTa, and DeBERTaV3. Negative and non-negative question variants allow further control for potential confounds related to frequency or question artifact effects.

4. Empirical Findings and Model Behavior

Experimental results confirm that many models revert to biased answers when context is ambiguous, rather than expressing uncertainty (“unknown”). Under disambiguated conditions, models generally exhibit higher overall accuracy, but reveal a clear effect: accuracy is sometimes up to 3.4 percentage points higher when the correct answer coincides with the stereotype, and for gender-related items, the gap can exceed 5 percentage points. This provides quantitative evidence that model predictions are not independent of social bias even when explicit context is provided.

BBQ’s fine-grained metrics, including bias scores across all contexts and subcategories, permit detailed cross-model and cross-category comparisons. This property makes the benchmark an essential tool for measuring progress in fairness evaluation and for identifying persistent failure cases by social dimension.

5. Influence on Model Development and Culturally-Adapted Extensions

The benchmark’s structure supports detailed “failure mode” analysis, supporting interventions such as:

Incorporation of mechanisms to recognize uncertainty (selecting “unknown”) when appropriate,
Prompt and architecture modifications (e.g., negative/positive question balancing, tuning for bias robustness),
Fine-tuning and pre-training data selection for bias reduction.

BBQ has directly inspired numerous adapted datasets and analysis suites, including KoBBQ for Korea (Jin et al., 2023), JBBQ for Japanese (Yanaka et al., 4 Jun 2024), MBBQ for multilingual comparison (Neplenbroek et al., 11 Jun 2024), GG-BBQ for German (Satheesh et al., 22 Jul 2025), BharatBBQ for Indian sociolinguistic contexts (Tomar et al., 9 Aug 2025), PakBBQ for Pakistan (Hashmat et al., 13 Aug 2025), and VoiceBBQ for spoken language analysis (Choi et al., 25 Sep 2025), each extending the template-driven methodology to local stereotype targets, linguistic particulars, and additional or merged bias dimensions. Adaptations include systematic template translation, cultural relabeling, and manual expert audit.

6. Broader Applications, Limitations, and Impact

BBQ serves as a diagnostic tool for both model development (internal QA/fairness tuning) and deployment (pre-launch audit of model outputs for sensitive use cases such as employment, legal decision-making, or public-facing assistants). Its analytic framework isolates context-induced bias, uncovers cross-category and intersectional vulnerabilities, and evaluates the effectiveness of interventions such as prompt engineering, debiasing methods, and self-diagnosis mechanisms (Yang et al., 12 Mar 2025). By exposing models’ susceptibility to both explicit and subtle social stereotypical reasoning, BBQ grounds empirical research on mitigation strategies, from ensemble calibration to chain-of-thought debiasing (Wu et al., 21 Feb 2025).

Recognized limitations include the focus on US-English centric stereotypes, limited coverage of intersectionality, and the necessity for ongoing adaptation to emerging social and linguistic realities. Nonetheless, the BBQ methodology provides a rigorous, extensible foundation for the comparative evaluation of social bias in QA across a rapidly expanding spectrum of language technologies and cultural settings (Parrish et al., 2021).