Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 109 tok/s
Gemini 3.0 Pro 52 tok/s Pro
Gemini 2.5 Flash 159 tok/s Pro
Kimi K2 203 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

RAG-GS: Golden Sets for RAG Evaluation

Updated 13 November 2025
  • rag-gs (MIT) is an open-source pipeline that constructs reproducible Top-K evidence sets for RAG evaluation using iterative refinement with Plackett–Luce optimization.
  • It introduces pool-restricted oracle ceilings (PROC) and %PROC metrics to diagnose retrieval versus ordering bottlenecks within a cost-latency-quality framework.
  • By integrating hybrid dense-sparse retrieval and active set refinement, the system minimizes ranking variance and ensures transparent, audit-friendly benchmarking.

rag-gs (MIT) is an open-source, version-controlled pipeline for constructing high-fidelity, globally consistent “golden sets” of evidence passages for Retrieval-Augmented Generation (RAG) system evaluation. Developed to resolve core deficiencies in conventional ranking-based IR metrics and single-shot LLM ranking methods, rag-gs produces a reproducible set of Top-K passages per query, robustly graded by an LLM and refined through iterative Plackett–Luce listwise optimization. The pipeline emits not only the final golden sets, but also pool-restricted oracle ceilings (PROC) and realized percentage of PROC (%PROC), facilitating diagnosis of system bottlenecks within a cost-latency-quality (CLQ) framework.

1. Motivation and Problem Formulation

In standard RAG evaluation, the goal is to identify, per query, a compact subset (“golden set”) of passages containing decisive evidence, graded by utility on a 1–5 scale. Existing practices exhibit several critical flaws:

  • Classical rank-based IR metrics (nDCG, MAP, MRR) are misaligned with RAG: LLMs consume sets, not browsed lists, rendering position discounts and prevalence-blind aggregation ineffective.
  • Absence of standardized, reproducible protocols for building and auditing golden sets impedes comparability and diagnostic precision.
  • Single-shot LLM ranking over hundreds of candidates displays 3–5 % Top-K instability per run; lacks uncertainty estimates and contradiction detection.
  • There is no mechanism for end-to-end benchmarking that reflects production trade-offs or exposes retrieval versus ordering bottlenecks.

rag-gs (MIT), as detailed in "Practical RAG Evaluation" (Dallaire, 12 Nov 2025), directly addresses these deficiencies with:

  1. Active set construction and refinement yielding reproducible Top-K sets (default K=20).
  2. Evaluable ceilings (PROC, %PROC) for explicit bottleneck analysis.
  3. Rigorous uncertainty quantification and contradiction control via iterative, confidence-aware refinement.

2. Pipeline Structure and Workflow

rag-gs comprises six audited, version-controlled stages:

Stage Operation Artifacts/Detail
S1 Embed Query rewriting, normalization; dense and sparse features Embeddings, BM25 vectors
S2 Retrieve Dense-cosine (flat or HNSW-F32) + BM25 top-N retrieval Candidate pools (dense/sparse)
S3 Merge Reciprocal Rank Fusion (RRF; α=60) Fused pool
S4 Score (Judge) LLM (e.g. GPT-5) batch judgment, utility grade 1–5 Grades g(d) for each candidate
S5 Prune Utility-based bucketed trimming to max pool size K Pruned subset
S6 Rank (Listwise Refinement) Iterative Plackett–Luce total-order refinement Final ranked Top-K; lock graph; convergence

Each stage is fully reproducible via committed manifests, configuration, and logging, supporting strict auditability.

3. Listwise Refinement via Plackett–Luce Optimization

rag-gs’s refinement leverages an active, iterative optimizer grounded in the Plackett–Luce model. Given scores sis_i for each passage ii, batches of m=5m=5 uncertain items are sampled and submitted to the LLM judge, who returns a total order π\pi over the batch.

For each suffix Sk={πk,...,πm}S_k = \{\pi_k, ..., \pi_m\}:

  • Plackett–Luce probability pj=exp(sj)/uSkexp(su)p_j = \exp(s_j) / \sum_{u \in S_k} \exp(s_u).
  • Scores updated per step:
    • Δsπk ⁣ ⁣+=η(1pπk)\Delta s_{\pi_k} \!\! += \eta (1-p_{\pi_k})
    • Δsjπk=ηpj\Delta s_{j \neq \pi_k} -= \eta p_j
  • Fisher information accumulated: Ij+=pj(1pj)I_j += p_j(1-p_j) for confidence bounds.
  • Acyclic lock-graph maintained: lock ww \to \ell if LCB(ww)>>UCB(\ell) or sufficient confirmations, enforcing global consistency.
  • Global order extracted by topological sorting; ties broken by sis_i.

Batch size mm, learning rate η\eta (with decay), clipping, confidence multiplier zz, and stabilization threshold TT control the refinement process. Refinement terminates when Top-K is unchanged for TT rounds or a cap is reached.

4. Empirical Performance and Comparative Analysis

Empirical evaluation demonstrates pronounced gains over single-shot LLM ranking:

  • Single-shot LLM (“rank these 40 docs”) yields 3–5 % run-to-run Top-K variance and fails to report uncertainty or resolve cyclic preferences.
  • rag-gs lowers pairwise ranking variance at O(1/exposures)O(1/\sqrt{\text{exposures}}), enforcing global acyclicity and yielding stable Top-K sets after typically 20–40 batch iterations, with total LLM cost \ll naïve ranking.
  • Human evaluation on 20 queries achieved perfect Top-20 agreement with a domain expert oracle, with negligible borderline ties.
  • When golden sets produced by rag-gs are used for downstream RA-nWG@K and %PROC metric calculation, results become strictly reproducible: e.g., Hybrid RRF-100 \rightarrow Rerank-2.5 \rightarrow Top-50 yielded PROC(ra-nWG@10)=1.000=1.000 and realized ra-nWG@10=0.852=0.852 (85.2 %PROC).

5. Diagnosing Retrieval vs. Ordering Headroom

rag-gs explicitly disentangles retrieval and ordering bottlenecks via PROC and %PROC:

  • PROC (pool-restricted oracle ceiling): the maximum attainable metric (e.g., RA-nWG@K) from the candidate pool, assuming ideal ordering.
  • %PROC: ratio of realized metric to PROC, quantifying ordering efficiency within the current candidate pool.
  • Low %PROC signals insufficient candidate pool (poor initial retrieval, suboptimal embeddings, absent query rewrite).
  • High %PROC but suboptimal realized RA-nWG indicates reranker or pre-processing issues (e.g., near-duplicate suppression required).

6. Practical Guidance and CLQ-Optimized Deployment

Prescribed best practices for rag-gs production integration:

  • Commit golden sets and lock-graphs once for a corpus, maintaining audit trails across system variants.
  • Adopt small batch sizes (m=5m=5) and a low stabilization threshold (T=3T=3) to minimize LLM calls (typical query cost \ll\$0.01).
  • Dynamically expose and tune hyperparameters (η\eta, clipping, zz, TT) in pipeline config to accommodate evolving LLM variance profiles.
  • Rely on golden-size Top-K to compute rarity-aware set metrics (RA-nWG@K), %PROC for bottleneck diagnosis, and N-Recall4+_{4+}@K for high-grade evidence coverage.
  • Consistently pair rag-gs outputs (manifests, PROC/%PROC) with cost-latency-quality frontiers, enabling budget-aware and fully reproducible evaluation and benchmarking.

7. Architectural and Methodological Features

rag-gs integrates hybrid dense + sparse retrieval modalities (embeddings, BM25), supports HNSW-ANN and quantization, and is compatible with multi-stage ranking (cross-encoder reranking). It is distributed with a permissive MIT license and emphasizes minimal resource consumption, real-time convergence, and transparent version/control for all artifacts. The framework provides targeted diagnostics for proper-name identity signal and conversational noise sensitivity, facilitating robust ablation studies and bias audits.

In summary, rag-gs (MIT) serves as a lightweight, mathematically principled solution for constructing golden evidence sets in RAG, vastly reducing ranking variance, resolving contradictions, and equipping practitioners with traceable, auditable, and production-grade metrics for corpus-level decision-making.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to rag-gs (MIT).