CRAG: A Comprehensive RAG Benchmark

Updated 30 March 2026

Comprehensive RAG Benchmark (CRAG) is a large-scale, multi-domain evaluation suite designed to assess retrieval, augmentation, and generation stages for reducing hallucinations in QA systems.
It employs rigorous methodologies including multi-stage retrieval, query diversification, and dynamic evidence fusion to simulate real-world scenarios with long-tail entities and temporal dynamism.
CRAG has set a new industry and academic standard, driving advances in multimodal QA and offering robust metrics that balance factual accuracy against hallucination risks.

The Comprehensive Retrieval-Augmented Generation Benchmark (CRAG) is a large-scale, multi-domain, and dynamic evaluation suite designed to stress-test the capabilities of Retrieval-Augmented Generation (RAG) systems, with an emphasis on reducing hallucinations and increasing factual reliability in both unimodal and multi-modal (notably vision-text) question answering (QA) settings. Conceived as a direct response to critical gaps in static, Wikipedia-centric, and limited-scale QA benchmarks, CRAG simulates real-world scenarios involving long-tail entities, high temporal dynamism, and complex information synthesis, and has quickly become an industry and academic standard for rigorous RAG evaluation (Yang et al., 2024, Zhang et al., 29 Jul 2025, Wang et al., 30 Oct 2025).

1. Benchmark Definition, Motivation, and Historical Context

CRAG was designed to systematically evaluate three interdependent RAG pipeline stages: retrieval (selection of relevant evidence), augmentation (integration of evidence with the query), and generation (formulation of a grounded, non-hallucinatory answer). Prior benchmarks, such as Natural Questions or TriviaQA, are limited by static facts, lack of external noise, and insufficient coverage of temporal or entity diversity. In contrast, CRAG introduces:

Cross-domain question diversity: finance, sports, music, movies, open/encyclopedic, and in CRAG-MM (the multi-modal/multi-turn extension): domains like shopping, science, fashion, and more.
Explicit temporal stratification: queries marked as real-time, fast-changing, slow-changing, or static, tied to the dynamism of real-world facts.
Long-tail stress: balanced entity sampling (head, torso, tail) and support for rare and short-lived facts through synthetic and user-derived queries.
Support for both text-based and visual QA: notably in CRAG-MM, which targets vision-LLMs interacting with egocentric images and wearable-device scenarios (Zhang et al., 29 Jul 2025, Wang et al., 30 Oct 2025).

This ambitious scope led to CRAG forming the backbone of the KDD Cup 2024 and 2025 challenges, fostering robust, reproducible, and community-driven benchmarking.

2. Dataset Composition and Task Design

CRAG and its multi-modal variant CRAG-MM are constructed as follows:

	CRAG	CRAG-MM
Domains	Finance, Sports, Music, Movie, Open	13 domains incl. Shopping, Food, Vehicles, etc.
Items	4,409 single-turn QA pairs	6,500 image-QA pairs + 2,000 multi-turn dialogues
Modalities	Text: Web + KG APIs	Vision + Text: images, image-KG, web, dialogue
Annotation	Eight Q-types: fact, comparison, etc.	Six Q-types: recognition, reasoning, multi-hop, etc.

Task Formats:

CRAG: three main tasks—retrieval summarization (5 web pages), KG+web augmentation (5 web + KG APIs), end-to-end RAG (50 pages + KG).
CRAG-MM: single-source augmentation (image-KG only), multi-source augmentation (image-KG + web), multi-turn conversational QA with up to 5 turns (Zhang et al., 29 Jul 2025, Wang et al., 30 Oct 2025).

Questions are crafted to require: single and multi-hop reasoning, fact lookup, aggregation, comparison, set enumeration, post-processing, and false-premise detection. Egocentric images (CRAG-MM) target long-tail and real-world problems (e.g., occlusion, low-light).

CRAG provides large retrieval corpora: up to 50 HTML pages and 2.6 million KG entities per query in CRAG; 68,000 KG entries and 800,000 web documents in CRAG-MM.

3. Evaluation Protocols and Metrics

CRAG employs evaluation metrics that penalize hallucination, reward truthful and complete answers, and provide detailed breakdowns:

Evaluation Aspect	Metric/Score Definition
Truthfulness/Final Score	Mean per-example: +1 (perfect), +0.5 (acceptable), 0 (“I don’t know”), –1 (incorrect/hallucinated) (Wang et al., 30 Oct 2025, Yang et al., 2024)
Accuracy, Hallucination, Missing	Accuracy = (TP + Acceptable)/N; Hallucination = Th/N; Missing = Tm/N
Multi-turn Protocol (CRAG-MM)	Dialogue ends after 2 consecutive errors; remaining turns = 0; score = avg per-turn
Additional IR/QG Metrics	Precision@k, Recall@k, F1, MRR, MAP, answer length capped at 50–75 tokens (Wang et al., 30 Oct 2025, Zhang et al., 29 Jul 2025)

State-of-the-art solutions on CRAG and CRAG-MM consistently demonstrate significant gains over LLM-only or naïve RAG. However, even industry-grade systems show persistent failure modes, with hallucination rates often exceeding 15–25% and deficits on long-tail, dynamic, and complex question types.

4. RAG Pipeline Methodologies and Key Frameworks

Top-ranked CRAG/CRAG-MM solutions follow a multi-step RAG protocol, often involving:

Retrieval: Embedding-based nearest-neighbor search (CLIP, ViT, Sentence-T5, BAAI/bge-m3), web search API access, and structured KG traversal.
Query Diversification: Multiple query re-writes generated conditioned on the image, question, and dialogue history to cover paraphrases and synonyms (Zhang et al., 29 Jul 2025).
Reranking: Listwise reranking using vision-LLMs or cross-encoder rerankers, with MAD-based dynamic thresholding to select the final evidence pool.
Hallucination Control: Refusal-tuned data (“I don’t know” examples), post-hoc verification stages (chain-of-verification, dual-path self-consistency, GPT-4o mini relabeling) (Chen et al., 27 Jul 2025, Zhang et al., 29 Jul 2025).
Data Augmentation: Paraphrase and hallucination-filtered answer re-generation to increase training signal, especially in low-resource settings (Zhang et al., 29 Jul 2025).
Unified Multi-task Fine-Tuning: LoRA-based parameter-efficient model adaptation for simultaneous retrieval query generation, reranking, and answer synthesis.

Prominent methods include dynamic routing based on query type and domain, multimodal contextual fusion, robust selection of references under noisy retrieval conditions, and explicit multi-stage verification to minimize hallucination at the expense of coverage (Chen et al., 27 Jul 2025, Ouyang et al., 2024).

5. Experimental Results and Insights

Key results on CRAG and CRAG-MM indicate:

Baseline LLM performance: ≤34% accuracy; vanilla RAG: ≤44% accuracy; SOTA RAG/industry models: ≤63% "perfect," with hallucination rates 17–25% (Yang et al., 2024, Wang et al., 30 Oct 2025).
Winning systems show that aggressive hallucination suppression (via “No Answer”/refusal) substantially increases net scores, as hallucination is penalized more harshly than omission (–1 vs 0).
Error analysis reveals the largest performance drops for finance and sports (high-dynamism), tail entities (–10 accuracy points), and set, aggregation, false-premise, and post-processing queries.
Ablations show each pipeline module (retrieval reranking, entity/time extraction, chain-of-thought prompting, post-generation verification) yields additive improvements in truthfulness and hallucination rates (Zhang et al., 29 Jul 2025, Chen et al., 27 Jul 2025, Ouyang et al., 2024).
In CRAG-MM, multi-turn dialogue settings expose robustness gaps: vanilla RAG achieves only 32–43% truthfulness, while best-in-class solutions reach ~45% (Wang et al., 30 Oct 2025).

6. Technical and Practical Takeaways

Retrieval Quality: Dense-hybrid retrieval combined with reranking and retrieval query rewriting consistently yields the largest gains, especially for complex and multi-source tasks.
Hallucination Mitigation: Refusal data, robust verification, and reranking are key. Overuse of refusal examples may result in an overly conservative system—refusal/data ratio tuning is necessary (Zhang et al., 29 Jul 2025).
Multi-source Fusion: For queries needing both structured and unstructured data, distinct evidence fusion and context window strategies are required.
Resource Efficiency: Competitive performance has been demonstrated with sub-10B parameter models using 4-bit LoRA; key advantage for real-world deployment (Chen et al., 2024).
Latency and Throughput: Top-performing solutions use pre-indexed retrieval, lightweight MLP classifiers for routing, and high-throughput vLLM for inference, balancing quality and operational constraints (Zhang et al., 29 Jul 2025, Chen et al., 27 Jul 2025).
Best Practices: Modular system design, query-conditioned chunking, dynamic adaptation to question entity/timeliness, systematic ablations, and orchestration of multi-stage verification.

7. Impact, Limitations, and Future Directions

CRAG and CRAG-MM have rapidly catalyzed progress, attracting thousands of competition participants, facilitating reproducible and cross-system evaluation, and providing a substrate for new methodologies in multimodal and conversational RAG. The benchmarks remain in active maintenance, with plans to expand into multilingual and more diverse modalities and to include robust real-time and incremental retrieval scenarios (Yang et al., 2024, Wang et al., 30 Oct 2025). Limitations remain in human evaluation scalability, continued hallucination in high-noise or dynamic settings, and automation of API endpoint selection and fusion.

This comprehensive structure, evaluation methodology, and community-driven design have made the Comprehensive RAG Benchmark (CRAG) and its multi-modal extensions pivotal resources for advancing retrieval-augmented QA with high factual fidelity in both academic and production settings (Yang et al., 2024, Zhang et al., 29 Jul 2025, Wang et al., 30 Oct 2025, Chen et al., 2024, Chen et al., 27 Jul 2025, Ouyang et al., 2024).