ReproRAG: Benchmarking RAG Reproducibility
- ReproRAG is a framework that rigorously defines and quantifies reproducibility in vector-based retrieval systems using metrics like L2 distance and Kendall’s Tau.
- It employs a modular architecture with configurable Python modules, FAISS integration, and distributed coordination to diagnose non-determinism in embeddings, indexing, and hardware execution.
- The framework helps researchers balance trade-offs between embedding fidelity and computational efficiency by providing actionable diagnostics and transparent benchmarks.
Retrieval-Augmented Generation (RAG) systems have become foundational in modern generative AI workflows targeting knowledge-intensive domains, scientific research, and technical applications. Despite their promise in integrating structured retrieval from large vector databases with LLMs, the reproducibility of their results remains a persistent challenge, primarily due to non-determinism in retrieval pipelines, embedding variability, and distributed execution artifacts. The ReproRAG framework—introduced as a systematic, open-source benchmarking solution—aims to rigorously quantify and characterize reproducibility in vector-based retrieval systems underpinning RAG applications (Wang et al., 23 Sep 2025). Its design provides researchers and engineers with standardized tools to measure, diagnose, and ultimately improve the reliability of retrieval stages in RAG workloads.
1. System Architecture and Scope
ReproRAG comprises a modular architecture that enables comprehensive evaluation and benchmarking of reproducibility across all layers of the RAG pipeline. The framework operates via configurable Python modules that can orchestrate local and distributed retrieval workflows. Central components include:
- ExperimentConfig: Specification of test parameters—embedding model selection, vector precision, algorithmic backend (e.g., Flat, IVF, HNSW, LSH indices), number of documents, and top-k retrieval threshold.
- Core Retrieval Engine: Built on FAISS, supporting both exact and approximate nearest neighbor algorithms, with options for hardware acceleration on CPU and GPU.
- Distributed Coordination Module: MPI-based design for evaluation across multiple nodes, controlling data sharding strategies (e.g., hash-based, range-based, random with fixed seed) and enforcing deterministic aggregation barriers.
- ReproducibilityMetrics: Computation and reporting library for reproducibility measures at both the embedding and retrieval stages.
The framework handles both single-run (identical environment) and multi-run/distributed (varying seeds, hardware, algorithm) scenarios, enabling users to probe the effects of environmental and pipeline changes on retrieval consistency.
2. Sources of Uncertainty in RAG Pipelines
ReproRAG systematically investigates and exposes several key sources of non-determinism:
- Embedding Model Variability: Differences among transformer-based embedding models (e.g., BGE, E5, Qwen) directly impact retrieval results. Cross-model comparisons reveal low overlap coefficients (≈0.43–0.54) and decreased rank stability (Kendall's Tau ≈0.32–0.38), underscoring that choice of encoder is the dominant contributor to irreproducibility.
- Numerical Precision: Vector representations may be computed in FP32, FP16, BF16, or TF32. Precision drift—measured via L2 distance between vectors—remains measurable but small (L2 ≈ 5.74e-04 between FP32 and FP16), introducing minor retrieval variability.
- Indexing Algorithms: Approximate nearest neighbor methods (IVF, HNSW, LSH) introduce randomness in index construction and querying. HNSW, for example, uses probabilistic graph building, which, unless controlled by a fixed seed, can yield result displacement. Still, with appropriate determinism flags, retrieval can achieve perfect stability.
- Hardware Effects: CPU/GPU scheduling and floating-point computation may cause small deviations; yet, with controlled execution (fixed seeds, stable hardware), reproducibility remains intact.
- Distributed Execution: Multi-node setups may suffer non-determinism from network latency, data sharding, race conditions, and aggregation order. Proper protocols—deterministic sharding and barrier synchronization—allow bit-for-bit reproducibility even in distributed retrieval.
3. Reproducibility Metrics
Assessment is performed using a suite of well-defined metrics:
Metric | Stage | Formula / Description |
---|---|---|
L2 Distance | Embedding | |
Cosine Similarity | Embedding | |
Exact Match Rate | Retrieval | Fraction of runs with identical k-NN document sets |
Jaccard Similarity | Retrieval | |
Kendall's Tau | Retrieval | (rank correlation) |
Rank-Biased Overlap | Retrieval | Weighted overlap for top ranks, accentuating high-ranked differences |
These metrics elucidate both embedding-level fidelity and retrieval output stability, allowing fine-grained investigation of reproducibility breakdowns.
4. Empirical Evaluation and Insights
Large-scale empirical analysis highlights several critical findings:
- Embedding model choice strongly governs reproducibility. Different models for the same query can yield substantially different document sets even for identical index and precision setups.
- Insertions to index (simulating time-based corpus evolution) displace earlier top-k results. Persisting documents maintain perfect rank stability (), but overall overlap (Jaccard or Overlap Coefficient ≈0.80) decreases.
- Numerical precision-induced drift remains minor, but potentially accumulates in sensitive downstream analyses.
- Rigorous control of randomness (deterministic flags, fixed seeds) can eliminate most algorithmic and hardware-induced variability, with measurements confirming EMR, Jaccard, and Kendall's Tau at 1.0 for controlled runs.
- Distributed retrieval, when synchronizing appropriately, achieves perfect reproducibility even in multi-node environments.
5. Practical Applications and Design Trade-Offs
ReproRAG enables several practical workflows in research and engineering contexts:
- Deployment Validation: Benchmark RAG pipelines post-deployment to ensure retrieval reliability before usage in critical science workflows.
- Design Decision Support: Quantify trade-offs between embedding fidelity and computational efficiency when selecting model or precision settings.
- Benchmarking and Debugging: Isolate pipeline stages causing unexpected variance, improving regression testing and continuous integration in large-scale AI workflows.
- Transparent Benchmarking: Allow third parties to assess reproducibility characteristics across diverse hardware, index types, and workload settings, promoting transparency in AI reliability research.
A plausible implication is that maximizing reproducibility may require trade-offs with performance, especially in dynamic or resource-constrained settings. The hierarchy of uncertainty—embedding > index > precision > hardware > distribution—provides guidance for prioritizing engineering efforts.
6. Mathematical Formalism
Representative formulas from the framework underscore the rigor of its approach:
- Jaccard Similarity:
- Cosine Similarity:
- Kendall’s Tau Rank Correlation: , where is concordant pairs and discordant among pairings.
- L2 Distance between vectors:
These metrics form the foundation for reproducibility benchmarking in high-throughput, vector-based retrieval research.
7. Significance for the Field
The ReproRAG framework establishes reproducibility evaluation as a first-class concern for RAG systems deployed in scientific AI workflows. By providing actionable diagnostics, transparent metrics, and robust benchmarking tools, it assists practitioners and researchers in making informed design choices, debugging complex pipelines, and fostering the development of reproducibility-aware AI infrastructure. The framework’s hierarchy of uncertainties and quantitative trade-off analysis lay the groundwork for systematic improvements in the reliability of vector-based retrieval—a requirement for trustworthy, reproducible computational science.