BBQ: Bias Benchmark for QA
- BBQ is a hand-built benchmark designed to evaluate output-level bias in QA systems by testing for harmful social stereotypes in U.S. English contexts.
- It uses multiple-choice templates with both ambiguous and disambiguated contexts to measure how models err when stereotypes conflict with correct answers.
- The benchmark employs rigorous human validation and quantitative metrics like bias scores and accuracy gaps to reveal and quantify system biases.
BBQ, short for Bias Benchmark for QA, is a hand-built benchmark for measuring whether question answering systems produce answers that reflect harmful social stereotypes. It was introduced to address a gap between evidence that LLMs encode biased associations and the more practically important question of when those associations actually change a model’s discrete QA outputs. In BBQ, bias is operationalized at the output level: a model may attribute a stigmatizing property to a person from a protected group in an under-informative context, or it may be more accurate when the correct answer aligns with a stereotype than when it conflicts with one (Parrish et al., 2021).
1. Origins, scope, and conceptual target
BBQ was designed for U.S. English-speaking contexts and targets attested social biases against people belonging to protected classes. The benchmark draws most of its social dimensions from U.S. protected-class categories used by the EEOC, with physical appearance added because of well-documented bias around traits such as weight (Parrish et al., 2021). The paper’s framing is explicitly behavioral rather than representational: it is not primarily concerned with whether a model internally associates a group with a trait, but with whether that association changes the answer returned to a concrete QA prompt.
The benchmark was also positioned against earlier QA-bias work such as UnQover. Three differences are central. First, BBQ focuses on categorical model outputs, not just score differences. Second, it always includes a correct answer option, including an “unknown” option for underspecified cases. Third, it covers a broader set of socially salient dimensions and targets specific attested harmful stereotypes, rather than only generic positive or negative associations (Parrish et al., 2021).
The stereotypes probed are concrete and negative. The benchmark includes, among others, older adults as cognitively declining, physically disabled people as less intelligent, girls as bad at math, Africans as technologically illiterate, obese people as unintelligent or sloppy, Black people as drug users or criminals, Jews as greedy, low-income people as bad parents, and gay men as HIV-positive. Each template is linked to a source documenting that stereotype’s harmful real-world existence (Parrish et al., 2021).
2. Dataset structure, coverage, and construction
Each BBQ item is a multiple-choice QA example with three answer choices. A template contains an ambiguous context, a disambiguating context, and two question variants: a negative question and a non-negative complement question. Every template yields a cluster of four examples: ambiguous/disambiguated crossed with negative/non-negative. The target and non-target identities are then swapped to balance order and group position (Parrish et al., 2021).
The benchmark covers nine main social dimensions and two separately analyzed intersectional categories: age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socio-economic status, and sexual orientation, plus race by gender and race by SES (Parrish et al., 2021).
| Category | Examples |
|---|---|
| Age | 3,680 |
| Disability status | 1,556 |
| Gender identity | 5,672 |
| Nationality | 3,080 |
| Physical appearance | 1,576 |
| Race/ethnicity | 6,880 |
| Religion | 1,200 |
| Sexual orientation | 864 |
| Socio-economic status | 6,864 |
| Race by gender | 15,960 |
| Race by SES | 11,160 |
In scale, BBQ contains 58,492 unique examples and 325 different templates. Each top-level category and each intersectional category has 25 unique templates, with race and gender each receiving an additional 25 templates using proper names. The templates expand to an average of about 175 questions each (Parrish et al., 2021).
The dataset is entirely hand-written by the authors and then validated with crowdworkers. Validation sampled one item from each of a template’s four conditions and had five Mechanical Turk annotators assign labels; a template had to reach at least 4/5 agreement with the gold label to be retained. On a final 300-example human evaluation sample, raw human accuracy was 95.7%, majority-vote human accuracy was 99.7%, and Krippendorff’s was 0.883 (Parrish et al., 2021). The paper does not define train/dev/test splits for benchmark use; BBQ is an evaluation dataset.
3. Evaluation design and bias metrics
A defining feature of BBQ is its use of two evaluation conditions. In the under-informative / ambiguous context condition, the context mentions two individuals and their group memberships but provides no evidence about which one has the queried property. The correct answer is always an “unknown” response such as “cannot be determined” or “not known.” A model is biased if, instead of choosing uncertainty, it selects the person associated with the harmful stereotype (Parrish et al., 2021).
In the adequately informative / disambiguated context condition, the context explicitly states which person has the relevant property. Here the benchmark tests whether social bias overrides contextual evidence. A model is biased if its errors disproportionately favor the stereotype-aligned answer. By construction, the bias target is correct half the time and the non-target is correct half the time, which prevents the benchmark from favoring stereotype-consistent answers accidentally (Parrish et al., 2021).
The benchmark also controls for common shortcuts. It permutes the order of target and non-target mentions, includes both negative and non-negative questions, and randomizes the wording of the unknown option over 10 semantically equivalent expressions (Parrish et al., 2021).
The main outcome measures are accuracy and bias score. In disambiguated contexts, the paper additionally defines the accuracy cost of bias nonalignment as
More negative values indicate worse performance when the correct answer conflicts with the stereotype (Parrish et al., 2021).
The disambiguated-context bias score is defined from the proportion of stereotype-aligned answers among non-unknown outputs:
Under this convention, means no aggregate directional bias, means all non-unknown answers align with the stereotype, and means all go against it. For ambiguous contexts, the score is scaled by inaccuracy:
This makes stereotype-based answers count more when the model more often fails to say “unknown” (Parrish et al., 2021).
4. Empirical findings on the original benchmark
The original paper evaluates UnifiedQA 11B, RoBERTa Base/Large, and DeBERTaV3 Base/Large. Overall BBQ accuracy ranges from 61.4% for RoBERTa-Base to 77.8% for UnifiedQA with RACE-style input, which is above chance () but masks a sharp split between ambiguous and disambiguated conditions. Models are much more accurate in disambiguated contexts and considerably worse in ambiguous ones, meaning they frequently fail to choose the correct unknown answer when the context is under-informative (Parrish et al., 2021).
Those ambiguous-context errors are not random. For every model, more than half of incorrect answers on ambiguous examples align with the targeted social bias. The exact proportions are 56% for RoBERTa-Base, 59% for RoBERTa-Large, 62% for DeBERTaV3-Base, 68% for DeBERTaV3-Large, 76% for UnifiedQA in RACE format, and 77% for UnifiedQA in ARC format (Parrish et al., 2021). A representative case is the ambiguous question “Who is bad at math?” about “a girl” and “a boy”: selecting the girl is treated not as generic uncertainty, but as a stereotype-based attribution.
In disambiguated contexts, the models are more accurate, but bias still affects performance. The headline finding is an up to 3.4 percentage point average accuracy advantage when the correct answer is aligned with the relevant social bias rather than conflicting with it. For gender, this aligned-versus-conflicting gap widens to over 5 percentage points for most tested models (Parrish et al., 2021). This indicates decreased QA accuracy precisely when truth conflicts with stereotype.
Patterns vary by category. In ambiguous contexts, physical appearance produced especially strong bias, whereas race and sexual orientation produced lower aggregate bias scores in these experiments. The authors also note that some larger models, especially UnifiedQA and DeBERTaV3-Large, showed stronger gender bias when gender was conveyed by proper names than by explicit labels such as “man” or “woman,” suggesting sensitivity to the surface form used to signal category membership (Parrish et al., 2021).
The physical-appearance example in the paper illustrates the benchmark’s logic at the answer-distribution level. In an ambiguous dorm-tour scenario with one obese and one non-obese student, UnifiedQA answered “Whose room was sloppy?” with the obese person 80.1% of the time, and “Whose room was neat and organized?” with the non-obese person 88.9% of the time, even though the correct answer in both cases was unknown (Parrish et al., 2021).
5. Extensions, derivatives, and methodological debates
BBQ has been extended in several directions. Open-BBQ preserves BBQ’s protected attributes, ambiguous versus disambiguated contexts, and negative versus non-negative question framing, but adds fill-in-the-blank and short-answer formats for open-ended generation. It reports 29,246 records, 58,492 context-conditioned instances, and 176,476 instances across the three question types, and uses GPT-4o to map free-text outputs back to BBQ labels. Its main finding is that open-ended generation often reveals more bias than standard multiple choice, with Age and Socio-economic Status emerging as particularly strong categories (Liu et al., 2024).
A different extension, ImplicitBBQ, retains BBQ’s multiple-choice QA structure and ambiguity logic while rewriting explicit mentions of protected attributes into implicit cues such as names, occupations, clothing, and relationship references. It covers 6 categories and 32,637 examples. On GPT-4o, accuracy often declines on the implicit version relative to explicit BBQ, with the largest reported drop in Sexual Orientation at -7.18 points, and the detailed analysis shows lower recall for the uncertain class when ambiguity remains (Wagh et al., 7 Dec 2025).
Other work has used BBQ to test whether bias measurements transfer across output modalities. BBG, a long-form story-generation benchmark derived from BBQ and KoBBQ, reports that multiple-choice QA behavior and long-form generation behavior are often inconsistent; in Korean, better BBQ ambiguous-context performance can even accompany worse generation behavior (Jin et al., 10 Mar 2025). VoiceBBQ converts BBQ contexts into speech under 16 controlled voice conditions and shows that spoken-LLMs can exhibit both content-related and acoustic-related bias, while keeping the original BBQ task structure (Choi et al., 25 Sep 2025). FilBBQ localizes the benchmark to Filipino and the Philippine context, producing more than 10,000 prompts and averaging bias scores across 50 seeds to address response instability in generative evaluation (Gamboa et al., 16 Feb 2026).
BBQ has also become a testbed for mitigation and for critiques of evaluation methodology. BMBI proposes bias mitigation for multiple-choice QA by tracking how one instance influences a reference ruler instance, and reports significant bias reduction across all 9 BBQ categories while maintaining comparable QA accuracy (Ma et al., 2023). Open-DeBias uses adapter-based debiasing and reports a 48.3% average accuracy increase on ambiguous BBQ contexts and a 5.2% increase on disambiguated ones relative to BMBI, together with large reductions in bias score and 84% zero-shot accuracy on Korean BBQ (Rani et al., 28 Sep 2025). By contrast, work on reasoning and prompt-based debiasing has used BBQ to argue that higher QA accuracy does not necessarily imply lower bias, and that apparent debiasing can reflect evasive answers such as overuse of “Unknown” rather than improved stereotype reasoning (Wu et al., 21 Feb 2025, Yang et al., 12 Mar 2025).
6. Limitations, interpretation, and continuing significance
The original BBQ paper is explicit about scope limitations. Because the benchmark is designed for U.S. English-speaking cultural contexts, low measured bias on BBQ does not imply that a model is broadly unbiased in other languages, cultures, or domains (Parrish et al., 2021). The benchmark also samples only a finite set of stereotypes—25 templates per category, plus additional name-based templates for race and gender—so near-zero bias scores should not be over-interpreted as proof of fairness.
The benchmark is intentionally narrower than probability-based methods such as UnQover because it targets output-level bias. This makes BBQ stricter in one sense and less exhaustive in another: it may miss representational biases that do not flip the top answer, but it directly measures the level at which users experience harm (Parrish et al., 2021). Some categories, especially the intersectional ones, yielded mixed or inconsistent results in the original analysis, and proper names are acknowledged to be an imperfect proxy for race or gender.
Within those constraints, BBQ has become useful as a fairness and bias evaluation benchmark for QA systems and, more broadly, for instruction-following and LLMs adapted to multiple-choice reasoning (Parrish et al., 2021). Its continuing importance lies in a precise operational question: when the context is ambiguous, does a model express justified uncertainty or make a stereotype-based attribution; when the context is informative, does bias reduce accuracy when truth conflicts with stereotype. A plausible implication is that BBQ’s enduring value is not that it exhausts social bias, but that it renders one particularly consequential part of it measurable in a controlled downstream task.