AIR-Bench: Multi-Domain AI Benchmark Suite

Updated 18 May 2026

AIR-Bench is a suite of automated, multi-domain benchmarks that evaluate AI performance in information retrieval, audio-language comprehension, and regulatory safety.
It leverages LLM-generated synthetic data and rigorous quality control methods to replace human annotation while incorporating diverse tasks and metrics such as nDCG and Recall.
The benchmarks provide extensible, regulation-aligned evaluation tools that support dynamic updates and robust comparisons for both research and practical AI applications.

AIR-Bench is a designation shared by several prominent benchmarks and evaluation frameworks in artificial intelligence, primarily in the domains of information retrieval (IR), large audio-LLMs (LALMs), regulatory compliance assessment, and AI safety. Multiple distinct initiatives—each published as “AIR-Bench” or a close variant—have addressed benchmarking for heterogeneous IR, generative comprehension for audio-language interaction, regulation-driven AI safety, and compliance with AI regulation. The following survey details the four most prominent benchmarks under the “AIR-Bench” designation: AIR-Bench (Automated Heterogeneous Information Retrieval Benchmark) (Chen et al., 2024); AIR-Bench (Audio Instruction Benchmark) (Yang et al., 2024); AIR-Bench 2024 (Regulation-Aligned Safety Benchmark) (Zeng et al., 2024); and, for context, the related AIRS-Bench (AI Research Science Benchmark) (Lupidi et al., 6 Feb 2026). Each is technically and methodologically distinct.

AIR-Bench provides a fully automated, domain- and language-diverse, and continually extensible suite for evaluating IR models. Its central innovation is replacing costly, time-consuming human query and relevance annotation pipelines with high-fidelity synthetic data, generated and curated entirely by LLMs.

Core Features and Coverage

Automated Test Set Generation: All queries, positive (“relevant”) documents, and hard negatives are generated by prompting a state-of-the-art LLM (OpenAI GPT-4) in a zero-shot regimen. There is no human crafting or labeling of queries or qrels.
Heterogeneous Task and Domain Structure: AIR-Bench covers two IR paradigms—ad hoc question answering (QA) and chunked retrieval from long documents (Long-Doc, prominent in RAG settings), applied across nine domains (news, finance, healthcare, law, web, Wikipedia, ArXiv, books, science) and thirteen languages (e.g., English, Chinese, Arabic, French, German, Hindi, Japanese, Russian).
Dynamic Growth: The suite is versioned in periodic "snapshots" (24.04, 24.05, ...), with each release able to incorporate new domains, tasks, or languages. All benchmarks, scripts, and metadata are publicly maintained [https://github.com/AIR-Bench/AIR-Bench].

Data Generation Pipeline

The data-generation pipeline comprises three stages:

Corpora Preparation: Domain- and language-specific corpora are curated, filtered for quality, and, if for Long-Doc, split into token-overlapping chunks.
Candidate Generation:
- For each instance, the LLM conditions on a sampled positive document, generates a persona/scenario, produces a diverse set of query phrasings, minimizes lexical overlap, and synthesizes multiple “hard negative” documents that match query surface structure but are non-relevant.
Quality Control:
- Synthetic queries and document pairs are filtered using an LLM as a relevance labeler (4-level relevance scale), removing non-relevant pairs.
- Candidate documents are retrieved and re-ranked using strong embedding and supervised reranker models; final relevance labels are subject to majority voting and additional LLM arbitration to correct false positives and negatives.
- After this filtering, queries are split into dev/test as per standard IR conventions.

Evaluation Metrics

AIR-Bench employs standard IR metrics:

Precision@ $k$ :

$\mathrm{P}@k = \frac{1}{k}\sum_{j=1}^k \mathbf{1}[d_j \text{ is relevant}]$

Recall@ $k$ :

$\mathrm{R}@k = \frac{\sum_{j=1}^k \mathbf{1}[d_j \text{ is relevant}]}{\#\text{relevant documents in corpus}}$

Discounted Cumulative Gain (DCG@ $k$ ):

$\mathrm{DCG}@k(q) = \sum_{j=1}^k \frac{2^{\mathrm{rel}_j} - 1}{\log_2(j + 1)}$

where $\mathrm{rel}_j \in \{0,1,2,3\}$ .

Normalized DCG:

$\mathrm{nDCG}@k = \frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}$

AIR-Bench Metrics: QA tasks use nDCG@10; Long-Doc tasks use Recall@10.

Alignment with Human-Labeled Data

An experimental comparison on MS MARCO (G-MSMARCO synthetic vs. R-MSMARCO real) revealed strong rank-order concordance between model performances on real and synthetic data with full QC ( $\rho = 0.8204, p = 3 \times 10^{-5}$ ). This establishes that LLM-generated IR test suites, when subject to robust QC, are competitive with traditional human annotation landscapes.

Extensibility and Usage

To extend AIR-Bench, practitioners add new corpora, edit the manifest, and re-run the standardized pipeline; prompt templates are language-agnostic. Public resources provide SDKs and leaderboards (retrieval-only, reranking, hybrid), and test qrels remain hidden for leaderboard integrity.

AIR-Bench addresses evaluation in the LALM paradigm, where models process and reason about audio (speech, sound, music) in generative interaction settings.

Motivation and Distinctiveness

Broader Scope: Traditional benchmarks focus on ASR or classification; AIR-Bench assesses both basic perceptual/classificatory capabilities and extended generative comprehension and instruction-following.
Compositional Coverage: AIR-Bench includes a Foundation Benchmark (19 tasks, 19k single-choice questions) and Chat Benchmark (2k open-ended, generative Q&A items).

Benchmark Construction

Foundation Benchmark

Tasks: 19 (speech, sound, music), each with ≈1k single-choice questions sourced and derived from existing labeled datasets (e.g., LibriSpeech, Common Voice, IEMOCAP, NSynth).
Question Generation: Leveraging GPT-4, each task receives 50 phrasing variants; distractors are either inherited, sampled, or LLM-generated, shuffled to reduce position bias.

Chat Benchmark

Content: 2k open-ended Q&A spanning speech, sound, music, and mixed audio (with synthetic combinations via loudness/temporal dislocation).
Process: GPT-4 synthesizes scenario-specific questions and “gold” reference answers, filtered and manually audited.

Evaluation Framework

Foundation: Binary scoring by GPT-4 against answer key.
Chat: Scalar scores (1–10), evaluating hypotheses against several criteria; bias mitigation via swapped-reference scoring.
Mathematical Definitions:

$\mathrm{Score}(\hat{y}, y) = \frac{1}{2}(s_1 + s_2), \quad s_i \in [1,10]$

Foundation accuracy:

$\mathrm{P}@k = \frac{1}{k}\sum_{j=1}^k \mathbf{1}[d_j \text{ is relevant}]$ 0

Results and Consistency

Top Audio-LLMs: Qwen-Audio-Turbo (57.8% accuracy Foundation; 6.34/10 Chat), Qwen-Audio-Chat, SALMONN, PandaGPT.
Chat vs. Foundation: Whisper + GPT-4 pipeline achieves 7.54/10 on speech Q&A, setting a practical upper bound for current system capabilities.
LLM Evaluator Reliability: >98% agreement on Foundation tasks with human raters.

Main Insights

Major model deficiencies include low accuracy on fine-grained music/pitch tasks, difficulty articulating required output formats, and vulnerability to context mixtures (speech and music/sound mingling). Future LALMs should emphasize diverse instruction tuning, improved cross-modal fusion, and automated evaluator prompt engineering.

AIR-Bench 2024 is a regulation- and policy-aligned AI safety evaluation framework, constructed from a bottom-up analysis of statutory and company regulatory language.

Taxonomy and Dataset Construction

Four-Tier Taxonomy: Based on eight major government regulations and sixteen leading company policies, AIR 2024 develops a safety taxonomy: 4 top-level domains, 16 intermediate, 45 mid-level, and 314 granular (Level-4) risk categories.
Prompt Construction: For each Level-4 risk, 5–10 “base prompts” are generated via LLMs (GPT-4-Turbo) and rigorously edited for clarity and context diversity, then mutated (dialects, endorsement) and manually vetted. The final dataset spans 5,694 prompts (~18 per category).

Evaluation Protocol

Category-Specific Autograders: Each prompt/response is scored (by a judge LLM, with context examples) as 0 (harmful), 0.5 (ambiguous), or 1 (refusal).
Refusal Rate:

$\mathrm{P}@k = \frac{1}{k}\sum_{j=1}^k \mathbf{1}[d_j \text{ is relevant}]$ 1

Human Grader Validation: Autograders yield Cohen’s κ=0.86 with human raters.

Key Findings

No model achieves uniform refusal; best-in-class models (Claude 3, Gemini 1.5 Pro) average ∼89–85% refusal rates, but only ∼71% on the EU AI Act “high-risk” set; markedly lower refusal for categories such as regulated advice, automated eligibility.
Internal category analysis reveals substantial variation—e.g., “hate toward Gender” elicits harmful outputs from 40–70% of models, despite high aggregate “Hate Speech” refusal.
AIR-Bench 2024 directly exposes the gap between empirical safety benchmarks and formal regulatory compliance.

Implications and Future Work

Regulators can use AIR-Bench 2024 as an audit tool; developers and researchers can identify and prioritize granular alignment deficiencies.
The taxonomy and prompt sets require ongoing updates as regulations and policies evolve.
Expansion to multimodal testing and continuous monitoring (“red-teaming drift”) is an urgent frontier.

Although not titled “AIR-Bench,” two related resources are noteworthy for completeness:

AIReg-Bench (Marino et al., 1 Oct 2025): A compliance assessment benchmark for LLMs, focusing on granular documentation-based judgement of conformity with the EU AI Act’s requirements, validated by legal experts and spanning five core Articles. Frontier models are compared to expert Likert ratings; Gemini 2.5 Pro achieves κ_w = 0.863, with bias and failure mode analyses detailed.
AIRS-Bench (Lupidi et al., 6 Feb 2026): Focuses on end-to-end evaluation of AI research agents over the full scientific research lifecycle, with a task suite culled from contemporary SOTA publications and using agentic scaffolds for code, analysis, and refinement.

5. Comparative Table of Distinct AIR-Bench Resources

Benchmark	Domain	Benchmarking Focus
AIR-Bench (IR)	Information Retrieval	Automated, heterogeneous, dynamic IR
AIR-Bench (Audio)	Audio-LLMs	Generative perception/comprehension
AIR-Bench 2024	AI Safety	Regulation- and policy-aligned safety
AIReg-Bench	Regulation Compliance	EU AI Act conformity, expert-labeled

Each initiative is independently developed, yet shares a focus on automating, diversifying, or aligning evaluation to match the evolving requirements of modern AI model development and governance.

6. Significance and Role in the Research Community

AIR-Bench benchmarks collectively represent a methodological turn in AI evaluation:

Prioritizing automation (via LLMs) over costly expert annotation;
Embracing domain, task, and language heterogeneity;
Aligning evaluation axes with external, real-world standards—be those regulatory, policy, or user-interaction driven;
Open-sourcing resources, data, and leaderboards to drive continuous improvement and reproducibility.

Available resources and rigorous protocol documentation (Chen et al., 2024, Yang et al., 2024, Zeng et al., 2024), along with empirical validation against human judgments, render AIR-Bench derivatives foundational in benchmarking for information retrieval, audio-language comprehension, and regulatory-aligned AI safety.

Markdown Report Issue Upgrade to Chat

References (5)

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark (2024)

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension (2024)

AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies (2024)

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents (2026)

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AIR-Bench.

AIR-Bench: Multi-Domain AI Benchmark Suite

1. Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench-IR) (Chen et al., 2024)

Core Features and Coverage

Data Generation Pipeline

Evaluation Metrics

Alignment with Human-Labeled Data

Extensibility and Usage

2. AIR-Bench: Generative Comprehension Benchmark for Audio-LLMs (Yang et al., 2024)

Motivation and Distinctiveness

Benchmark Construction

Foundation Benchmark

Chat Benchmark

Evaluation Framework

Results and Consistency

Main Insights

3. AIR-Bench 2024: Regulation-Driven AI Safety Benchmark (Zeng et al., 2024)

Taxonomy and Dataset Construction

Evaluation Protocol

Key Findings

Implications and Future Work

5. Comparative Table of Distinct AIR-Bench Resources

6. Significance and Role in the Research Community

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

AIR-Bench: Multi-Domain AI Benchmark Suite

1. Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench-IR) (Chen et al., 2024)

Core Features and Coverage

Data Generation Pipeline

Evaluation Metrics

Alignment with Human-Labeled Data

Extensibility and Usage

2. AIR-Bench: Generative Comprehension Benchmark for Audio-LLMs (Yang et al., 2024)

Motivation and Distinctiveness

Benchmark Construction

Foundation Benchmark

Chat Benchmark

Evaluation Framework

Results and Consistency

Main Insights

3. AIR-Bench 2024: Regulation-Driven AI Safety Benchmark (Zeng et al., 2024)

Taxonomy and Dataset Construction

Evaluation Protocol

Key Findings

Implications and Future Work

4. Related Benchmarks: Regulatory Compliance (AIReg-Bench) and Research Agents (AIRS-Bench)

5. Comparative Table of Distinct AIR-Bench Resources

6. Significance and Role in the Research Community

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics