RAG-GS: Golden Sets for RAG Evaluation

Updated 13 November 2025

rag-gs (MIT) is an open-source pipeline that constructs reproducible Top-K evidence sets for RAG evaluation using iterative refinement with Plackett–Luce optimization.
It introduces pool-restricted oracle ceilings (PROC) and %PROC metrics to diagnose retrieval versus ordering bottlenecks within a cost-latency-quality framework.
By integrating hybrid dense-sparse retrieval and active set refinement, the system minimizes ranking variance and ensures transparent, audit-friendly benchmarking.

rag-gs (MIT) is an open-source, version-controlled pipeline for constructing high-fidelity, globally consistent “golden sets” of evidence passages for Retrieval-Augmented Generation (RAG) system evaluation. Developed to resolve core deficiencies in conventional ranking-based IR metrics and single-shot LLM ranking methods, rag-gs produces a reproducible set of Top-K passages per query, robustly graded by an LLM and refined through iterative Plackett–Luce listwise optimization. The pipeline emits not only the final golden sets, but also pool-restricted oracle ceilings (PROC) and realized percentage of PROC (%PROC), facilitating diagnosis of system bottlenecks within a cost-latency-quality (CLQ) framework.

1. Motivation and Problem Formulation

In standard RAG evaluation, the goal is to identify, per query, a compact subset (“golden set”) of passages containing decisive evidence, graded by utility on a 1–5 scale. Existing practices exhibit several critical flaws:

Classical rank-based IR metrics (nDCG, MAP, MRR) are misaligned with RAG: LLMs consume sets, not browsed lists, rendering position discounts and prevalence-blind aggregation ineffective.
Absence of standardized, reproducible protocols for building and auditing golden sets impedes comparability and diagnostic precision.
Single-shot LLM ranking over hundreds of candidates displays 3–5 % Top-K instability per run; lacks uncertainty estimates and contradiction detection.
There is no mechanism for end-to-end benchmarking that reflects production trade-offs or exposes retrieval versus ordering bottlenecks.

rag-gs (MIT), as detailed in "Practical RAG Evaluation" (Dallaire, 12 Nov 2025), directly addresses these deficiencies with:

Active set construction and refinement yielding reproducible Top-K sets (default K=20).
Evaluable ceilings (PROC, %PROC) for explicit bottleneck analysis.
Rigorous uncertainty quantification and contradiction control via iterative, confidence-aware refinement.

2. Pipeline Structure and Workflow

rag-gs comprises six audited, version-controlled stages:

Stage	Operation	Artifacts/Detail
S1 Embed	Query rewriting, normalization; dense and sparse features	Embeddings, BM25 vectors
S2 Retrieve	Dense-cosine (flat or HNSW-F32) + BM25 top-N retrieval	Candidate pools (dense/sparse)
S3 Merge	Reciprocal Rank Fusion (RRF; α=60)	Fused pool
S4 Score (Judge)	LLM (e.g. GPT-5) batch judgment, utility grade 1–5	Grades g(d) for each candidate
S5 Prune	Utility-based bucketed trimming to max pool size K	Pruned subset
S6 Rank (Listwise Refinement)	Iterative Plackett–Luce total-order refinement	Final ranked Top-K; lock graph; convergence

Each stage is fully reproducible via committed manifests, configuration, and logging, supporting strict auditability.

rag-gs’s refinement leverages an active, iterative optimizer grounded in the Plackett–Luce model. Given scores $s_i$ for each passage $i$ , batches of $m=5$ uncertain items are sampled and submitted to the LLM judge, who returns a total order $\pi$ over the batch.

For each suffix $S_k = \{\pi_k, ..., \pi_m\}$ :

Plackett–Luce probability $p_j = \exp(s_j) / \sum_{u \in S_k} \exp(s_u)$ .
Scores updated per step:
- $\Delta s_{\pi_k} \!\! += \eta (1-p_{\pi_k})$
- $\Delta s_{j \neq \pi_k} -= \eta p_j$
Fisher information accumulated: $I_j += p_j(1-p_j)$ for confidence bounds.
Acyclic lock-graph maintained: lock $w \to \ell$ if LCB( $w$ ) $>$ UCB( $\ell$ ) or sufficient confirmations, enforcing global consistency.
Global order extracted by topological sorting; ties broken by $s_i$ .

Batch size $m$ , learning rate $\eta$ (with decay), clipping, confidence multiplier $z$ , and stabilization threshold $T$ control the refinement process. Refinement terminates when Top-K is unchanged for $T$ rounds or a cap is reached.

4. Empirical Performance and Comparative Analysis

Empirical evaluation demonstrates pronounced gains over single-shot LLM ranking:

Single-shot LLM (“rank these 40 docs”) yields 3–5 % run-to-run Top-K variance and fails to report uncertainty or resolve cyclic preferences.
rag-gs lowers pairwise ranking variance at $O(1/\sqrt{\text{exposures}})$ , enforcing global acyclicity and yielding stable Top-K sets after typically 20–40 batch iterations, with total LLM cost $\ll$ naïve ranking.
Human evaluation on 20 queries achieved perfect Top-20 agreement with a domain expert oracle, with negligible borderline ties.
When golden sets produced by rag-gs are used for downstream RA-nWG@K and %PROC metric calculation, results become strictly reproducible: e.g., Hybrid RRF-100 $\rightarrow$ Rerank-2.5 $\rightarrow$ Top-50 yielded PROC(ra-nWG@10) $=1.000$ and realized ra-nWG@10 $=0.852$ (85.2 %PROC).

5. Diagnosing Retrieval vs. Ordering Headroom

rag-gs explicitly disentangles retrieval and ordering bottlenecks via PROC and %PROC:

PROC (pool-restricted oracle ceiling): the maximum attainable metric (e.g., RA-nWG@K) from the candidate pool, assuming ideal ordering.
%PROC: ratio of realized metric to PROC, quantifying ordering efficiency within the current candidate pool.
Low %PROC signals insufficient candidate pool (poor initial retrieval, suboptimal embeddings, absent query rewrite).
High %PROC but suboptimal realized RA-nWG indicates reranker or pre-processing issues (e.g., near-duplicate suppression required).

6. Practical Guidance and CLQ-Optimized Deployment

Prescribed best practices for rag-gs production integration:

Commit golden sets and lock-graphs once for a corpus, maintaining audit trails across system variants.
Adopt small batch sizes ( $m=5$ ) and a low stabilization threshold ( $T=3$ ) to minimize LLM calls (typical query cost $\ll$ \$0.01).
Dynamically expose and tune hyperparameters ( $\eta$ , clipping, $z$ , $T$ ) in pipeline config to accommodate evolving LLM variance profiles.
Rely on golden-size Top-K to compute rarity-aware set metrics (RA-nWG@K), %PROC for bottleneck diagnosis, and N-Recall $_{4+}$ @K for high-grade evidence coverage.
Consistently pair rag-gs outputs (manifests, PROC/%PROC) with cost-latency-quality frontiers, enabling budget-aware and fully reproducible evaluation and benchmarking.

7. Architectural and Methodological Features

rag-gs integrates hybrid dense + sparse retrieval modalities (embeddings, BM25), supports HNSW-ANN and quantization, and is compatible with multi-stage ranking (cross-encoder reranking). It is distributed with a permissive MIT license and emphasizes minimal resource consumption, real-time convergence, and transparent version/control for all artifacts. The framework provides targeted diagnostics for proper-name identity signal and conversational noise sensitivity, facilitating robust ablation studies and bias audits.

In summary, rag-gs (MIT) serves as a lightweight, mathematically principled solution for constructing golden evidence sets in RAG, vastly reducing ranking variance, resolving contradictions, and equipping practitioners with traceable, auditable, and production-grade metrics for corpus-level decision-making.

PDF Markdown Chat (Pro)

References (1)

Practical RAG Evaluation: A Rarity-Aware Set-Based Metric and Cost-Latency-Quality Trade-offs (2025)

Follow Topic

Get notified by email when new papers are published related to rag-gs (MIT).

RAG-GS: Golden Sets for RAG Evaluation

1. Motivation and Problem Formulation

2. Pipeline Structure and Workflow

3. Listwise Refinement via Plackett–Luce Optimization

4. Empirical Performance and Comparative Analysis

5. Diagnosing Retrieval vs. Ordering Headroom

6. Practical Guidance and CLQ-Optimized Deployment

7. Architectural and Methodological Features

Follow Topic

Continue Learning

RAG-GS: Golden Sets for RAG Evaluation

1. Motivation and Problem Formulation

2. Pipeline Structure and Workflow

3. Listwise Refinement via Plackett–Luce Optimization

4. Empirical Performance and Comparative Analysis

5. Diagnosing Retrieval vs. Ordering Headroom

6. Practical Guidance and CLQ-Optimized Deployment

7. Architectural and Methodological Features

Follow Topic

Continue Learning

Related Topics