SIGIR 2025 LiveRAG Challenge

Updated 24 July 2025

The SIGIR 2025 LiveRAG Challenge is an international competition that benchmarks retrieval-augmented generation systems using a fixed web corpus and standardized LLMs.
Participants engineered modular RAG pipelines combining document retrieval, query rewriting, and evidence-grounded answer generation under strict live constraints.
Innovations such as hybrid retrieval, knowledge-aware reranking, and synthetic benchmark generation drove improved correctness and faithfulness metrics among top-performing teams.

The SIGIR 2025 LiveRAG Challenge was a large-scale, international competition aimed at advancing the research and practical deployment of Retrieval-Augmented Generation (RAG) for open-domain question answering. With participation from both academia and industry, the challenge established a rigorous, standardized environment for in-depth evaluation and benchmarking of end-to-end RAG pipelines, focusing on the crucial interplay between retrieval quality, prompt engineering, and evidence-grounded answer generation (Carmel et al., 7 Jul 2025).

1. Goals and Structure of the Challenge

The LiveRAG Challenge sought to facilitate challenging comparisons of RAG system components—namely, document retrieval, query rewriting, context curation, and prompt strategy—operating within a unified framework (Carmel et al., 7 Jul 2025). Participants were tasked with building RAG-based question answering systems constrained to use a fixed 15M-document web corpus (FineWeb-10BT) and a common open-source generative LLM (Falcon3-10B-Instruct). This standardization allowed head-to-head evaluations of retrieval and augmentation innovations independent of model size or training data.

The timeline spanned March–May 2025, culminating in a Live Challenge Day during which 40 selected teams (from 70 initial applicants across 27 countries) answered 500 unseen, synthetically generated questions within a strict two-hour window—averaging less than 14 seconds per answer. Submissions included the generated answer, supporting evidence passages, and the prompt fed to the LLM (Carmel et al., 7 Jul 2025). The event provided free access to AWS, Pinecone, and Falcon3 compute resources, lowering the barrier to entry and ensuring reproducibility.

2. Evaluation Process and Metrics

A two-stage evaluation process was implemented to ensure fairness and depth (Carmel et al., 7 Jul 2025). First, an automated LLM-judge approach scored answers for:

Correctness: The degree to which the response covered key/“vital” information, combining coverage of “Direct” and “Useful” claims (from reference answers) via Natural Language Inference (NLI)-based entailment.
Faithfulness: Extent of grounding—whether all claims in the answer could be mapped to explicit supporting evidence among the retrieved documents.

Formally, coverage and faithfulness scores were computed using NLI functions over decomposed, atomic claims:

$\text{Cov}(a, r) = \alpha \left(\frac{\sum_{c \in D_r} NLI(a, c)}{|D_r|}\right) + (1-\alpha) \left(\frac{\sum_{c \in U_r} NLI(a, c)}{|U_r|}\right)$

where $D_r$ and $U_r$ are “Direct” and “Useful” claim sets, $\alpha=0.7$ . Faithfulness is given by:

$F(a, R) = \frac{\sum_{c \in C_a} \max_{r \in R} NLI(c, r)}{|C_a|}$

with $C_a$ the set of answer claims and $R$ the evidence set.

High-performing teams advanced to manual review, where domain-expert annotators scored Coverage, Relatedness, and Quality on a [0–2] scale. The ranking correlated tightly with automated metrics, validating the evaluation framework.

3. Datasets and Benchmark Generation

Question diversity was central to the challenge. The 500-item test set for each session was automatically generated using the DataMorgana tool—a two-stage, LLM-driven synthetic Q&A benchmark generator (Filice et al., 22 Jan 2025). DataMorgana enabled structured configuration of user categories (e.g., “novice”, “expert”) and question types (factoid, non-factoid, multi-hop, search query–style, etc.), as well as control over their probability distributions. The system ensured high lexical, syntactic, and semantic diversity through probabilistic sampling and rigorous filtering. Diversity was validated using metrics such as N-Gram Diversity (NDG), Self-Repetition Scores (SRS), PoS-compression, and Homogenization Scores, with DataMorgana outperforming alternative methods on all axes (Filice et al., 22 Jan 2025).

This synthetic diversity was critical for simulating real-world “traffic”—with test cases spanning both single-hop and multi-hop reasoning, audience-specific language, and multiple domains—providing a rich and robust platform to stress-test and compare RAG systems.

4. System Architectures and Algorithmic Strategies

The challenge fostered a spectrum of RAG architectures, with leading submissions exhibiting substantial methodological diversity and innovation. Common architectural themes included:

Hybrid Retrieval: Most systems combined sparse (BM25 via OpenSearch) and dense (Pinecone/E5 or other embedding models) retrieval, often fusing top- $k$ results using strategies like Reciprocal Rank Fusion (RRF) or normalized score summation (Bakagianni et al., 18 Jun 2025, Fensore et al., 27 Jun 2025).
Query Rewriting and Decomposition: Teams employed targeted query rewriting modules (often using their own LLMs) to correct typos, clarify or decompose multi-intent queries (Martinez et al., 20 Jun 2025, Dong et al., 26 Jun 2025), and, in some cases, to generate “hypothetical answers” used as additional queries during retrieval (Ran et al., 17 Jun 2025). PreQRAG, for example, first classified each query as single- or multi-document, then performed task-adapted rewrites and decomposition (Martinez et al., 20 Jun 2025).
Reranking and Evidence Curation: Many pipelines applied neural (e.g., BGE) or cross-encoder rerankers to select context passages (Cofala et al., 17 Jun 2025, Bakagianni et al., 18 Jun 2025, Fensore et al., 27 Jun 2025). Some, as in Marikarp’s winning entry, introduced “knowledge-aware reranking” that decomposed queries into knowledge elements, then iteratively summarized and used these to refine evidence selection (Zhou, 25 Jun 2025).
Answer Generation and Prompt Design: All systems converged on Falcon3-10B-Instruct for final answer generation, but with nuanced differences: prompt variants included chain-of-thought instructions (Dong et al., 26 Jun 2025), rationale-based denoising (Cofala et al., 17 Jun 2025), order manipulation (e.g., presenting high-ranked documents closest to the query), and answer refusal logic (“I don't know” if unsupported) (Ran et al., 17 Jun 2025).
Clustering and Filtering: Some approaches, such as TopClustRAG, applied K-Means clustering on retrieved passages for semantic diversity, feeding cluster representatives to the LLM and aggregating multiple intermediate responses (Bakagianni et al., 18 Jun 2025). Nugget-based pipelines extracted minimal, claim-level facts to maximize factuality and reduce redundancy (Łajewska et al., 27 Jun 2025).
Multi-Agent and Graph-based Reasoning: mRAG introduced a modular, multi-agent pipeline with specialized roles (planning, search, reasoning, validation, generation), coordinated by reward-guided self-training (Salemi et al., 12 Jun 2025). GeAR adapted graph-based retrieval, aligning passages to Wikidata triples to enable multi-hop and relationship-aware synthesis over millions of passages (Shen et al., 23 Jul 2025).

A summary table of innovative strategies is provided below.

System	Core Innovations	Notable Metrics
Marikarp	Knowledge-aware diverse reranking	Correctness: 1.23, Faithfulness: 0.66 (Zhou, 25 Jun 2025)
PreQRAG	Query classification and rewriting	Relevance: 0.884 (Martinez et al., 20 Jun 2025)
RAGtifier	Explicit rationale-based denoising, inverted context	Correctness: 1.13, Faithfulness: 0.55 (Cofala et al., 17 Jun 2025)
RMIT–ADM+S (GRAG)	Hypothetical answer augmentation, ANOVA selection	Relevance: 1.199, Faithfulness: 0.477 (Ran et al., 17 Jun 2025)
TopClustRAG	Clustering-based passage aggregation	Faithfulness: 0.460, Correctness: 0.685 (Bakagianni et al., 18 Jun 2025)
DoTA-RAG	Query rewriting, dynamic sub-index routing	Correctness: 0.929 (live) (Ruangtanusak et al., 14 Jun 2025)
Omni-RAG	LLM-powered deep query understanding, chain-of-thought	Rank 2 Session 1 (Dong et al., 26 Jun 2025)

5. Insights, Outcomes, and Key Results

All submissions outperformed a “no-retrieval” baseline, underscoring the substantial impact of retrieval augmentation (Carmel et al., 7 Jul 2025). The top-ranking systems (e.g., Marikarp, UDInfo, RAGtifier) distinguished themselves through innovations in knowledge decomposition, reranking, and targeted prompt design, achieving correctness scores above 1.1 and faithfulness near 0.7 (Zhou, 25 Jun 2025, Martinez et al., 20 Jun 2025, Cofala et al., 17 Jun 2025).

Key observations include:

Hybrid and knowledge-aware reranking strategies offered the best trade-off between recall and faithfulness, especially for multi-hop, compositional queries (Zhou, 25 Jun 2025, Martinez et al., 20 Jun 2025).
Query rewriting and task-adaptive decomposition significantly improved retrieval accuracy, particularly on multi-document or ambiguous queries (Martinez et al., 20 Jun 2025, Dong et al., 26 Jun 2025).
Overly aggressive augmentation (e.g., excessive document counts, unfiltered rewrites) could harm both efficiency and faithfulness, necessitating systematic ablation and grid search to tune hyperparameters (Łajewska et al., 27 Jun 2025, Bakagianni et al., 18 Jun 2025).
Automated evaluation leveraging explicit NLI-based scoring correlated strongly with human assessments, providing a scalable and replicable framework for future challenges (Carmel et al., 7 Jul 2025).
Live constraints (e.g., two-hour window for 500 questions) surfaced bottlenecks, with re-ranking and multi-agent reasoning offering improved performance at the expense of increased computational load (Fensore et al., 27 Jun 2025, Salemi et al., 12 Jun 2025).

6. Methodological Trends and Future Directions

Methodologies emerging from the LiveRAG Challenge have set important precedents for RAG system evaluation and real-world deployment (Wang et al., 27 Oct 2024, Cai et al., 21 May 2024):

Pipeline Modularity: Modular architectures—in which query processing, retrieval, reranking, and generation are developed and tuned as separate stages—facilitate ablation, error analysis, and targeted improvements (Łajewska et al., 27 Jun 2025).
Synthetic Benchmarks as Catalysts: Tools like DataMorgana provide systematic, configurable diversity in test cases, bridging the gap between narrow, in-domain evaluation and real-world application (Filice et al., 22 Jan 2025).
Robustness to Query Variety: Systems which effectively handle noisy, complex, or multi-intent queries through LLM-powered decomposition show consistent gains in both correctness and faithfulness (Dong et al., 26 Jun 2025).
Efficiency–Quality Tradeoffs: Deep, neural reranking (e.g., RankLLaMA, cross-encoders) and multi-agent strategies improve accuracy but challenge latency constraints; careful resource allocation and parallelization are essential under live evaluation demands (Fensore et al., 27 Jun 2025, Salemi et al., 12 Jun 2025).
Richer Evidence Structuring: Clustering and information nugget approaches point toward future evidence selection strategies that scale with corpus size and increase factual coverage without inflating prompt lengths or redundancy (Bakagianni et al., 18 Jun 2025, Łajewska et al., 27 Jun 2025).

A plausible implication is that these methodological advances and open-source tools will inform both academic evaluation standards and industrial deployment in domains requiring scalable, reliable, and fact-grounded question answering.

7. Impact and Legacy

The SIGIR 2025 LiveRAG Challenge has catalyzed significant advances in retrieval-augmented generation, establishing best practices for evidence-driven answer generation, modular system design, and principled evaluation (Carmel et al., 7 Jul 2025). Its rigorous benchmarking, open resources, and emphasis on real-time, evidence-grounded reasoning have produced reproducible, extensible frameworks for future research. The event’s impact is evidenced by widespread adoption of DataMorgana for benchmark generation, consensus on hybrid and reranking strategies, and the adoption of explicit NLI-based quality metrics. Subsequent SIGIR and IR community challenges are likely to build upon this foundation, extending it to multimodal, cross-lingual, and dynamic knowledge base scenarios as highlighted in parallel RAG workshops and ongoing research (Wang et al., 27 Oct 2024).