RAG-GS: Golden Sets for RAG Evaluation
- rag-gs (MIT) is an open-source pipeline that constructs reproducible Top-K evidence sets for RAG evaluation using iterative refinement with Plackett–Luce optimization.
- It introduces pool-restricted oracle ceilings (PROC) and %PROC metrics to diagnose retrieval versus ordering bottlenecks within a cost-latency-quality framework.
- By integrating hybrid dense-sparse retrieval and active set refinement, the system minimizes ranking variance and ensures transparent, audit-friendly benchmarking.
rag-gs (MIT) is an open-source, version-controlled pipeline for constructing high-fidelity, globally consistent “golden sets” of evidence passages for Retrieval-Augmented Generation (RAG) system evaluation. Developed to resolve core deficiencies in conventional ranking-based IR metrics and single-shot LLM ranking methods, rag-gs produces a reproducible set of Top-K passages per query, robustly graded by an LLM and refined through iterative Plackett–Luce listwise optimization. The pipeline emits not only the final golden sets, but also pool-restricted oracle ceilings (PROC) and realized percentage of PROC (%PROC), facilitating diagnosis of system bottlenecks within a cost-latency-quality (CLQ) framework.
1. Motivation and Problem Formulation
In standard RAG evaluation, the goal is to identify, per query, a compact subset (“golden set”) of passages containing decisive evidence, graded by utility on a 1–5 scale. Existing practices exhibit several critical flaws:
- Classical rank-based IR metrics (nDCG, MAP, MRR) are misaligned with RAG: LLMs consume sets, not browsed lists, rendering position discounts and prevalence-blind aggregation ineffective.
- Absence of standardized, reproducible protocols for building and auditing golden sets impedes comparability and diagnostic precision.
- Single-shot LLM ranking over hundreds of candidates displays 3–5 % Top-K instability per run; lacks uncertainty estimates and contradiction detection.
- There is no mechanism for end-to-end benchmarking that reflects production trade-offs or exposes retrieval versus ordering bottlenecks.
rag-gs (MIT), as detailed in "Practical RAG Evaluation" (Dallaire, 12 Nov 2025), directly addresses these deficiencies with:
- Active set construction and refinement yielding reproducible Top-K sets (default K=20).
- Evaluable ceilings (PROC, %PROC) for explicit bottleneck analysis.
- Rigorous uncertainty quantification and contradiction control via iterative, confidence-aware refinement.
2. Pipeline Structure and Workflow
rag-gs comprises six audited, version-controlled stages:
| Stage | Operation | Artifacts/Detail |
|---|---|---|
| S1 Embed | Query rewriting, normalization; dense and sparse features | Embeddings, BM25 vectors |
| S2 Retrieve | Dense-cosine (flat or HNSW-F32) + BM25 top-N retrieval | Candidate pools (dense/sparse) |
| S3 Merge | Reciprocal Rank Fusion (RRF; α=60) | Fused pool |
| S4 Score (Judge) | LLM (e.g. GPT-5) batch judgment, utility grade 1–5 | Grades g(d) for each candidate |
| S5 Prune | Utility-based bucketed trimming to max pool size K | Pruned subset |
| S6 Rank (Listwise Refinement) | Iterative Plackett–Luce total-order refinement | Final ranked Top-K; lock graph; convergence |
Each stage is fully reproducible via committed manifests, configuration, and logging, supporting strict auditability.
3. Listwise Refinement via Plackett–Luce Optimization
rag-gs’s refinement leverages an active, iterative optimizer grounded in the Plackett–Luce model. Given scores for each passage , batches of uncertain items are sampled and submitted to the LLM judge, who returns a total order over the batch.
For each suffix :
- Plackett–Luce probability .
- Scores updated per step:
- Fisher information accumulated: for confidence bounds.
- Acyclic lock-graph maintained: lock if LCB()UCB() or sufficient confirmations, enforcing global consistency.
- Global order extracted by topological sorting; ties broken by .
Batch size , learning rate (with decay), clipping, confidence multiplier , and stabilization threshold control the refinement process. Refinement terminates when Top-K is unchanged for rounds or a cap is reached.
4. Empirical Performance and Comparative Analysis
Empirical evaluation demonstrates pronounced gains over single-shot LLM ranking:
- Single-shot LLM (“rank these 40 docs”) yields 3–5 % run-to-run Top-K variance and fails to report uncertainty or resolve cyclic preferences.
- rag-gs lowers pairwise ranking variance at , enforcing global acyclicity and yielding stable Top-K sets after typically 20–40 batch iterations, with total LLM cost naïve ranking.
- Human evaluation on 20 queries achieved perfect Top-20 agreement with a domain expert oracle, with negligible borderline ties.
- When golden sets produced by rag-gs are used for downstream RA-nWG@K and %PROC metric calculation, results become strictly reproducible: e.g., Hybrid RRF-100 Rerank-2.5 Top-50 yielded PROC(ra-nWG@10) and realized ra-nWG@10 (85.2 %PROC).
5. Diagnosing Retrieval vs. Ordering Headroom
rag-gs explicitly disentangles retrieval and ordering bottlenecks via PROC and %PROC:
- PROC (pool-restricted oracle ceiling): the maximum attainable metric (e.g., RA-nWG@K) from the candidate pool, assuming ideal ordering.
- %PROC: ratio of realized metric to PROC, quantifying ordering efficiency within the current candidate pool.
- Low %PROC signals insufficient candidate pool (poor initial retrieval, suboptimal embeddings, absent query rewrite).
- High %PROC but suboptimal realized RA-nWG indicates reranker or pre-processing issues (e.g., near-duplicate suppression required).
6. Practical Guidance and CLQ-Optimized Deployment
Prescribed best practices for rag-gs production integration:
- Commit golden sets and lock-graphs once for a corpus, maintaining audit trails across system variants.
- Adopt small batch sizes () and a low stabilization threshold () to minimize LLM calls (typical query cost \$0.01).
- Dynamically expose and tune hyperparameters (, clipping, , ) in pipeline config to accommodate evolving LLM variance profiles.
- Rely on golden-size Top-K to compute rarity-aware set metrics (RA-nWG@K), %PROC for bottleneck diagnosis, and N-Recall@K for high-grade evidence coverage.
- Consistently pair rag-gs outputs (manifests, PROC/%PROC) with cost-latency-quality frontiers, enabling budget-aware and fully reproducible evaluation and benchmarking.
7. Architectural and Methodological Features
rag-gs integrates hybrid dense + sparse retrieval modalities (embeddings, BM25), supports HNSW-ANN and quantization, and is compatible with multi-stage ranking (cross-encoder reranking). It is distributed with a permissive MIT license and emphasizes minimal resource consumption, real-time convergence, and transparent version/control for all artifacts. The framework provides targeted diagnostics for proper-name identity signal and conversational noise sensitivity, facilitating robust ablation studies and bias audits.
In summary, rag-gs (MIT) serves as a lightweight, mathematically principled solution for constructing golden evidence sets in RAG, vastly reducing ranking variance, resolving contradictions, and equipping practitioners with traceable, auditable, and production-grade metrics for corpus-level decision-making.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free