BERGEN: RAG Benchmarking Library
- The paper introduces BERGEN as an open-source framework that standardizes and automates benchmarking for retrieval-augmented generation pipelines.
- BERGEN’s modular design enables interchangeable components like retrievers, rerankers, and LLMs, facilitating rapid experimentation via configurable YAML setups.
- Empirical findings using BERGEN show improved retrieval and generation metrics, highlighting the benefits of integrated reranking and multilingual support in RAG systems.
BERGEN (BEnchmark on Retrieval-augmented GENeration) is an open-source Python library designed to standardize, automate, and facilitate comprehensive benchmarking and reproducible experimentation in retrieval-augmented generation (RAG). Addressing the complexity of contemporary RAG pipelines—which encompass a diverse ecosystem of retrievers, rerankers, LLMs, datasets, and evaluation metrics—BERGEN provides an integrated, modular framework to enable rigorous analysis of end-to-end RAG performance across multiple dimensions (Rau et al., 2024).
1. System Architecture and Pipeline Structure
BERGEN’s architecture is organized around four distinct, configurable stages: collection management, first-stage retrieval, optional reranking, and answer generation. Each segment is implemented as an interchangeable module, supporting rapid experimentation and ablation.
- Collection Manager: Orchestrates the downloading, preprocessing, and indexing of corpora, notably incorporating KILT Wikipedia (24.8 million passages of 100 words with title context) and multilingual variants. Datasets are managed in Hugging Face Arrow format, supporting memory- and disk-backed access.
- First-Stage Retriever: Supports a spectrum of sparse (BM25 via Pyserini, SPLADE_v2/v3) and dense retrieval architectures (CoCondenser, BGE, RepLLaMA, GTE, E5, RetrofitMAE, etc.). At inference, for a query , passages are shortlisted via .
- Reranker: Optionally applies a cross-encoder reranker (MiniLM, DeBERTa-v3, BGE-M3) to the top-K retrieved candidates, rescoring and condensing to a smaller subset (typically ).
- Generative LLM Interface: Provides a model-agnostic interface for generation using Hugging Face, quantized, local, or custom LLMs. All hyperparameters and generation controls are YAML-configurable.
- Prompting Templates: Standardized templates govern system and user prompts, with explicit support for closed-book (no evidence) and multilingual prompting.
- Evaluation Engine: Implements a multi-metric assessment suite, covering surface-level (Exact Match, F1, ROUGE), semantic (BEM, LLMeval, GPT-4 judge), and retrieval metrics (Recall@, MRR).
2. Configuration, Modularity, and Extensibility
Experiments in BERGEN are defined through hierarchical Hydra YAML configurations, encapsulating every parameter—dataset, corpus, retriever, reranker, generation model, fine-tuning regime, and metric set. This mechanism enables full reproducibility and unambiguous experiment specification.
Adding new components is streamlined:
- To register a new LLM, users provide a minimal YAML file, specifying location, quantization, and other parameters.
- For retrievers or rerankers, the modular architecture allows rapid integration of additional backends and models.
- Multilingual workflows are natively supported via dataset and prompt configurations.
Typical configuration options include:
| Parameter | Example Values | Notes |
|---|---|---|
dataset |
kilt_nq, eli5, mkqa |
Datasets from Hugging Face or KILT |
collection |
kilt_wikipedia, wikimedia/wikipedia |
English or multilingual Wikipedia indexes |
retriever |
bm25, splade-v3, retromae, bge-base-en-v1.5 |
Sparse and dense retrieval models |
reranker |
minilm6, deberta-v3, bge-reranker-v2-m3 |
Cross-encoder rerankers |
generator |
Llama2-7B-chat, SOLAR-10.7B, Mixtral-8x7B |
Any Hugging Face LLM or compatible local model |
train |
none, ft, lora |
Zero-shot, full fine-tuning, or QLoRA |
metrics |
[exact_match, f1, rouge_l, llmeval] |
Combination of surface, semantic, and retrieval metrics |
3. Formal Task Modeling and Evaluation Metrics
Within BERGEN, the RAG problem is formalized as follows:
- Given query , a retrieval system builds an index and finds relevant contexts .
- An LLM generates response under system template 0.
Scoring comprises:
- Retrieval: Recall@1, mean reciprocal rank (MRR), computed over passage-level relevance.
- Generation: Exact Match, token-level Precision/Recall/F1, ROUGE-2.
- Semantic Matching: BEM (soft n-gram matching, implementation provided but no novel formula).
- LLM-based Judging (LLMeval): Zero-shot scoring using, e.g., SOLAR-10.7B, prompted as semantic judges; binary judgments are averaged across samples.
Metrics are implemented as first-class modules; experimenters can select, combine, or extend them via configuration.
4. Benchmarking: Datasets, Component Studies, and Findings
BERGEN’s extensive benchmark suite encompasses over 500 RAG configurations across 10 English question-answering datasets and two large-scale multilingual datasets:
- Datasets: NaturalQuestions, TriviaQA, HotpotQA, ASQA, PopQA, SCIQ, WikiQA, Wizard-of-Wikipedia, ELI5, TruthfulQA (English); MKQA (26 languages), XOR-TyDi QA (7 languages).
- Collections: Standardized KILT Wikipedia, Wikimedia, and multilingual corpora.
Principal empirical results include:
- Retrieval quality (Recall@3) is highly predictive of end-to-end RAG success (Kendall 4 with LLMeval).
- Cross-encoder reranking (e.g. DeBERTa-v3) increases recall and semantic answer quality: for BM25, R@5 improves from 0.53 to 0.71 with reranking; corresponding LLMeval improvements are 0.1–0.2 points.
- State-of-the-art sparse (SPLADE-v3) and hybrid retrievers with reranking (DeBERTa-v3) perform best in aggregate (e.g., R@5 ≈ 0.83 on NQ).
- Retrieval augments LLMs of all scales: Llama2-7B with retrieval can surpass closed-book Llama2-70B; gains from retrieval are not strictly proportional to LLM parameter count.
- Fine-tuning (via QLoRA) yields significant gains, especially for resource-constrained models (e.g., +0.41 LLMeval on NQ for TinyLlama-1.1B).
- Oracle context (using only gold passages) sets an upper bound for LLMeval (0.82), with zero-shot models at 0.65.
- Certain tasks (ELI5, WoW, TruthfulQA, SCIQ) show minimal or negative benefit from retrieval, highlighting the influence of task design and dataset-document alignment.
For multilingual RAG:
- BGE-M3 reranker enables robust cross-lingual retrieval.
- Generative LLM language is controlled via prompt translation and explicit generation instructions.
- Retrieval in the origin/source language outperforms target-language retrieval in some tasks (e.g., MKQA).
5. Best Practices and Experimental Guidelines
BERGEN distills a set of empirically-derived recommendations to support rigorous and reproducible RAG research:
- Employ both sparse and dense retrievers; avoid reliance on BM25 alone.
- Always include a reranking stage prior to generation, leveraging cross-encoder architectures.
- Supplement classical metrics (EM, F1) with semantic judgment metrics (LLMeval, GPT-4 judge); these better capture the quality of modern RAG outputs, especially in zero-shot settings.
- Select datasets judiciously; not all benchmarks are suited for RAG (e.g., dialogue-centric or open-ended, non-document QA).
- Fix the collection snapshot and preprocessing pipeline; reproducibility depends critically on datum versioning and deterministic document chunking.
- Control prompt templates across runs and explicitly set language outputs for multilingual experiments.
- Share full configurations (including seeds, architectures, and hyperparameters) to ensure full transparency and reproducibility.
6. Usage Patterns, Command-Line Workflow, and Code Integration
BERGEN exposes all core functionality via command-line and scriptable APIs. Illustrative workflows:
- Zero-shot evaluation:
5
- Fine-tuning with QLoRA:
6
- Adding a new LLM:
7
This modularity enables rapid prototyping, component benchmarking, and head-to-head comparison across a diverse array of RAG configurations (Rau et al., 2024).
7. Position within the RAG Benchmarking Ecosystem
Compared to alternative toolkits, BERGEN is characterized by end-to-end, black-box RAG benchmarking, with emphasis on standardized experiment management and broad support for retrievers, rerankers, and generators (Mao et al., 2024). However, BERGEN does not dissect or diagnose errors at the component (phase) level; nor does it implement fine-grained failure mode taxonomies or pipeline-internal attributions as offered by XRAG. BERGEN is designed for large-scale, surface-to-semantic metric-based RAG evaluation rather than component-level debugging or diagnostic engineering. Its design is informed by best-practice guidelines from more granular benchmarking frameworks and recent ablation studies (Do et al., 8 Feb 2026).
By providing a metrics-rich, highly configurable, and reproducible benchmarking suite, BERGEN serves as a central infrastructure for empirical study and optimization of RAG systems in both academic and industrial research.