BiasAsker Benchmark

Updated 6 December 2025

BiasAsker Benchmark is an open-source suite that systematically quantifies and visualizes bias in conversational AI and vision–language models.
It measures both explicit and subtle biases by generating diverse natural-language queries and employing statistical metrics such as absolute, related, and modality bias.
The tool offers interactive dashboards, reproducible evaluation protocols, and comparative insights across commercial and research AI systems.

The BiasAsker Benchmark is a suite of open-source evaluation resources and tools designed to quantify, visualize, and mitigate bias in both conversational artificial intelligence systems and vision–LLMs. Its principal innovation is to systematically generate, categorize, and measure both explicit and subtle biases through automated probing, enabling comprehensive benchmarking of model fairness, modality robustness, and social group treatment. BiasAsker’s methodology has underpinned state-of-the-art bias evaluation for LLMs, vision–language architectures, and hybrid multimodal systems (Wan et al., 2023, Väth et al., 2021, Salimian et al., 29 Nov 2025, Berrayana et al., 20 Jun 2025).

1. Framework Overview and Core Objectives

BiasAsker’s architecture is an end-to-end, automated pipeline structured into three major modules:

Social Bias Dataset Construction: Builds a comprehensive lexicon of social groups and biased properties, mapped to high-level taxonomies.
Automated Question Generation: Systematically enumerates input prompts, crossing group attributes and stereotype properties, outputting multiple natural-language question types per tuple.
Bias Measurement and Visualization: Queries black-box models and extracts bias statistics using existence measurement and Monte Carlo estimation, with results rendered interactively for both aggregate and sample-level inspection (Wan et al., 2023, Väth et al., 2021).

For vision–language tasks, BiasAsker additionally provides a modality bias analysis tool, quantifying a model’s tendency to ignore image or question input by means of input perturbations and output agreement ratios (Väth et al., 2021).

2. Dataset Construction and Taxonomy

BiasAsker fuses and normalizes content from large-scale social bias corpora, namely SBIC, StereoSet, and HolisticBias, resulting in:

841 social groups, each classified under one of 11 high-level attributes (e.g., Ability, Age, Gender, Race, Profession).
8,110 biased properties, curated via extraction and filtering of stereotype slots, each mapped to one of 12 semantic property categories (e.g., Appearance, Morality, Competence).
Each biased property is paired with its antonym (e.g., “are lazy”/“are industrious”), reducing filtration bias in system responses.
Full bilingual (English/Chinese) coverage achieved via machine translation and semantic validation (Wan et al., 2023).

In the VQA domain, BiasAsker integrates pre-configured benchmarks: VQA2, GQA, GQA-OOD, CLEVR, OK-VQA, and TextVQA; new datasets require only implementation of a lightweight loader interface (Väth et al., 2021).

3. Bias Metrics: Definitions and Mathematical Formulation

Modality Bias (Vision–Language):

For VQA models, BiasAsker provides explicit formulas to estimate to what extent a model’s answers remain invariant when either the image or the question is replaced by a distractor. Let $f(i,q)$ be the model’s answer on image $i$ with question $q$ , and $N$ be the number of distractors:

$\text{ModalityBias}_{\text{Image}} = \frac{1}{M}\sum_{j=1}^M \left(\frac{1}{N}\sum_{k=1}^N \mathbf{1}\left[f(i_j,q_j) = f(i^\prime_{j,k},q_j)\right]\right) \times 100\%$

A 100% bias score represents total disregard for that modality (Väth et al., 2021).

Absolute Bias: For tuple $\langle g_i, g_j, p\rangle$ , where $g_i$ and $g_j$ are groups and $p$ is a property, the absolute bias rate is:

$\mathrm{AbsBias}(g_i,g_j,p) = \frac{1}{|Q_a(g_i,g_j,p)|}\sum_{q\in Q_a(g_i,g_j,p)} I_{\text{biased}}(r_q)$

where $I_{\text{biased}}(r_q)$ flags responses showing affirmations or preferences (Wan et al., 2023).

Related Bias: Measures whether one group’s preference rate differs from peers, formally as the variance of affirmation rates:

$B_r(A,C) = \mathrm{Var}\left\{\mathrm{pref}(g,C) \mid g \in G\right\}$

where $\mathrm{pref}(g,C)$ is the average affirmation rate for group $g$ in category $C$ (Wan et al., 2023).

Bias detection uses “existence measurement,” encompassing n-gram, word-embedding, and sentence-similarity checks for both English and Chinese prompts.

4. Evaluation Protocols and Representative Results

BiasAsker’s protocol is standardized, supporting reproducibility and extensibility across both commercial and academic models (Wan et al., 2023):

Conversational Systems: Tested across 8 commercial systems (e.g., ChatGPT, GPT-3, XiaoAi, Kuki) and 2 research models (BlenderBot, DialoGPT).
Key results:
- Maximum observed absolute-bias rate: 32.83% (Jovi); mean ≈ 20%.
- ChatGPT’s lowest absolute-bias rate: 2.7% (but nonzero related bias remains, e.g., variance 0.85% on gender preference).
- DialoGPT: highest related bias, especially in gender-positive stereotype attributes (variance ≈ 13.6%).
VQA Models: BAN-8, MCAN, MMNASNET, and MDETR display varying modality bias; e.g., BAN-8 up to 90% image bias on yes/no CLEVR questions, MDETR shows elevated image bias (12%) on TextVQA (Väth et al., 2021).

Table 1: Example Modalities and Bias Types

Task Domain	Bias Metric	Definition/Computation
Conversational AI	Absolute, Related	Proportion/variance of group-favoring responses
Vision-Language	Modality Bias	Output invariance to question/image perturbation

5. Benchmarking Methodology and Tools

Automated Probing:

Prompt generation spans three canonical forms: Yes/No, Choice, and Wh- (“Why”) questions, with permutations for absolute and related bias evaluation. For social bias, every prompt is constructed via template and rule-based transformations; each is then sent to the target system, with responses subject to automated analysis.

Visualization and Drill-Down:

BiasAsker features interactive dashboards (D3.js) for histogram, boxplot, and sample-level inspection. Filtering by model, dataset, bias metric, or value range is integral, as is export of adversarial samples for further retraining or forensic analysis (Väth et al., 2021).

Integrating new datasets or models requires minimal development: subclassing Python interfaces for data loading or prediction, and updating YAML configuration. All experiments are fully scriptable and outputs are preserved in per-model JSON records (Wan et al., 2023, Väth et al., 2021).

6. Comparative Methods, Robustness, and Mitigation

Alternative bias evaluation methods—structured QA datasets (e.g., BBQ), LLM-as-a-judge frameworks, sentiment-based counterfactuals—show divergent model rankings when applied to the same demographic axes, even under harmonized conditions (Berrayana et al., 20 Jun 2025). BiasAsker can harmonize demographic lists and templates to better compare evaluation paradigms. Rank agreement metrics such as Spearman’s $\rho$ , Kendall’s $\tau$ , and mean absolute rank difference quantify robustness of benchmark results.

A major advance is the application of Metamorphic Relations (MRs) for uncovering hidden or contextually induced bias. Six formal MRs—contextual preambles, attribute flipping, group swaps—generate semantically equivalent but adversarial variants, substantially exposing up to 14% more latent bias than prompt-level filters. Fine-tuning with MR-augmented datasets increases bias resiliency from ≈55–76% to ≈89% without degrading neutral QA performance (Salimian et al., 29 Nov 2025).

7. Limitations, Best Practices, and Future Directions

Limitations

Dataset coverage: While extensive (841 groups × 8,110 properties), intersectional and non-binary dimensions require expansion.
Oracle reliability: Existence measurement achieves ≈93% accuracy on manual inspection, but subtle cases persist.
Prompt context: Only single-turn exchanges; multi-turn or contextualized bias remains unaddressed (Wan et al., 2023).

Best Practices

Harmonize prompt generation, group lists, and metrics across comparative methods (Berrayana et al., 20 Jun 2025).
Employ statistical significance testing (Friedman, bootstrap, Wilcoxon) on rank differences for benchmark reliability.
Publish templates, judgment prompts, and model provenance to facilitate reproducibility and external audit.

Future Directions

Extend to additional languages and incorporate user-profile simulations for intersectional testing.
Replace static probes with learned adversarial prompts (e.g., reinforcement learning-generated questions).
Develop combined oracle/classifier pipelines for improved bias detection accuracy (Wan et al., 2023, Salimian et al., 29 Nov 2025).

BiasAsker remains a foundational tool for the systematic discovery, quantification, and mitigation of model bias in contemporary AI, adaptable across domains, architectures, and evaluation philosophies.