Papers
Topics
Authors
Recent
2000 character limit reached

RAGLens: Hallucination & RAG Evaluation

Updated 10 December 2025
  • RAGLens is a suite of methodologies that detect hallucinations in retrieval-augmented generation systems by analyzing LLM hidden states with sparse autoencoders.
  • It employs rarity-aware, set-based metrics to evaluate RAG pipeline quality, balancing cost, latency, and the decisiveness of retrieved evidence.
  • Its interpretable GAM structure provides instance-level rationales for flagged hallucinations, enabling targeted post-hoc mitigation.

RAGLens is a suite of methodologies and tools for both detection of hallucinations in retrieval-augmented generation (RAG) produced by LLMs and for practical, reproducible evaluation of RAG pipeline quality and resource efficiency. RAGLens encompasses two axes: a lightweight, interpretable hallucination detector leveraging sparse autoencoders for mechanistic feature analysis within LLM hidden states (Xiong et al., 9 Dec 2025), and a rarity-aware, set-based metric and diagnostics framework for RAG pipeline evaluation and auditing under cost/latency/quality constraints (Dallaire, 12 Nov 2025).

1. Motivation and Faithfulness Challenges in RAG

RAG architectures condition LLM outputs on retrieved evidence passages to improve factual grounding. Despite this, unfaithful behaviors—collectively termed hallucinations—persist. Hallucinations manifest in three principal forms: direct contradictions of source evidence, unsupported details such as fabricated dates or entities, and illegitimate extrapolation beyond provided context.

Limitations of prior hallucination detection are multi-fold: supervised detectors demand large, annotated corpora and are data-expensive; LLM-as-judge approaches incur high inference cost and are sensitive to prompt design, with ambiguous correlation to the model’s latent trace. Internal probes of raw hidden states or attention signals are hindered by polysemanticity and low signal-to-noise ratio (Xiong et al., 9 Dec 2025).

On the evaluation side, IR metrics such as nDCG, MAP, and MRR neglect the intrinsic set-based consumption pattern in RAG and do not accommodate passage prevalence, positional irrelevance, or the decisive evidence criterion. This creates a pressing need for per-query-normalized, rarity-aware metrics and headroom estimators that reflect real operator trade-offs (Dallaire, 12 Nov 2025).

2. Hallucination Detection via Sparse Autoencoders

RAGLens employs recent mechanistic interpretability advances: sparse autoencoders (SAEs) trained on LLM hidden-state vectors XRdX\in\mathbb{R}^d can extract “monosemantic” features—each feature corresponds to a consistent, interpretable function such as a factual pattern or entity type.

Model Components

  • Sparse Autoencoder Structure: The SAE consists of an encoder E:RdRKE:\mathbb{R}^d\rightarrow\mathbb{R}^K and decoder D:RKRdD:\mathbb{R}^K\rightarrow\mathbb{R}^d, minimizing a reconstruction loss with an activation sparsity penalty

Lrec=XX^22,Lsparse=βi=1KKL(ρρ^i)\mathcal{L}_{\mathrm{rec}} = \|X - \hat X\|_{2}^{2},\quad\mathcal{L}_{\mathrm{sparse}} = \beta \sum_{i=1}^K \mathrm{KL}(\rho \,\|\,\hat\rho_i)

where ρ^i\hat\rho_i is the mean activation of feature ii and ρ\rho is the target sparsity.

  • Token-Level Feature Encoding and Pooling: For generated output tokens y1:Ty_{1:T}, hidden states ht=P(L)(y1:t,q,C)h_t=P^{(L)}(y_{1:t},q,C) are encoded to sparse vectors zt=E(ht)z_t = E(h_t). Channel-wise max pooling yields an instance vector Zk=max1tTzt,kZ_k = \max_{1\le t\le T} z_{t,k}.
  • Feature Selection and Additive Modeling: Not all features are informative for hallucination. Selection is performed by estimating mutual information (MI) with ground-truth labels, discretizing ZkZ_k into 50 quantiles, and retaining the KK' features with highest I(Zk;)I(Z_k;\ell). A generalized additive model (GAM) is fit:

g(E[zS])=β0+j=1Kfj(zsj)g\left(\mathbb{E}[\ell|z_S]\right) = \beta_0 + \sum_{j=1}^{K'} f_j(z_{s_j})

where fjf_j are univariate shape functions, gg is the link (logit), and zSz_S is the MI-selected feature subvector.

Inference and Thresholding

RAGLens detection follows:

  1. Hidden state extraction for each output token;
  2. SAE encoding and pooling to feature vector;
  3. Restricting to KK' selected features;
  4. GAM scoring;
  5. Thresholding on score s>τs>\tau (default τ=0.5\tau=0.5).

ROC-derived cutoffs may be additionally applied for task-specific operating points.

3. Interpretability and Rationale Generation

The RAGLens GAM’s additive structure enables both local and global interpretability:

  • Instance-level: For any example, the score decomposes into feature-wise contributions fj(zsj)f_j(z_{s_j}), and the maximal contributing token can be reverse-traced—yielding token-level rationale spans for flagged hallucinations.
  • Model-level: SAE features typically map to coherent semantic concepts (e.g., “unsupported numeric/time specifics–high risk”), and the dependence of hallucination likelihood on feature activation is visualizable via learned shape functions.

This interpretability supports downstream actions, such as targeted post-hoc mitigation and causal interventions at the level of specific features or activations (Xiong et al., 9 Dec 2025).

4. Experimental Results and Empirical Insights

Detection Performance

RAGLens outperforms previous methods on multiple evaluation sets:

Dataset/Model Prior Best AUC RAGLens AUC Δ (AUC)
RAGTruth/Llama2-7B 0.7458 0.8413 +0.10
Dolly/Llama2-7B 0.7949 0.8764 +0.08

Consistent improvements extend to Llama2-13B, Llama3, and Qwen architectures.

Ablation Findings

  • Layer choice: Mid-layers of LLMs yield highest detection accuracy.
  • Feature extraction: Pre-activation SAE features superior to post-activation.
  • Feature count: MI-based selection provides graceful degradation with lower KK', whereas random selection collapses quickly.
  • Predictor: GAM outperforms logistic regression, XGBoost, and MLP despite its additive restrictions.

This suggests strong alignment between monosemantic SAE features and hallucination signals concentrated in specific network layers.

5. Production-Oriented RAG Evaluation with RAGLens

A complementary axis of RAGLens is a reproducible, auditable framework for RAG pipeline evaluation (Dallaire, 12 Nov 2025). Key components include:

Rarity-Aware Set-Based Metrics

  • RA-nWG@K: Per-query normalized set gain metric emphasizing rare and decisive evidence, mitigating over-incentivization of abundant but low-utility passages.

RA-nWG@K={Gobs(K)Gideal(K)if Gideal(K)>0 NAotherwise\text{RA-nWG@K} = \begin{cases} \frac{G_{\mathrm{obs}}(K)}{G_{\mathrm{ideal}}(K)} &\text{if }G_{\mathrm{ideal}}(K)>0 \ \text{NA} &\text{otherwise} \end{cases}

Rarity-aware weights wgw_g penalize missed scarce items more than common ones, capped to prevent overweighting.

  • Operational Headroom: PROC and %PROC: PROC@K quantifies the oracle attainable gain in the retrieval pool versus the ground truth; %PROC benchmarks ordering performance within the candidate set, separating retrieval headroom from ordering inefficiency.

Cost–Latency–Quality (CLQ) Analysis

CLQ organizes design decisions along axes of computational expenditure, latency (including embed, retrieval, and rerank durations), and retrieval/generation quality. Systematic Pareto optimization and efficiency tie-breakers are prescribed for operator trade-off tuning.

Golden-Set Construction (rag-gs Pipeline)

rag-gs pipeline comprises six stages—embedding, retrieval, merging, LLM grading, pruning, and iterative Plackett–Luce refinement with uncertainty-aware locks—to yield stable, reproducible golden sets minimally influenced by LLM judge variance.

Diagnostics and Benchmarking

Proper-name identity and conversational-noise margins (Δ-metrics) are prescribed for early identification of failure modes in retrieval and reranking, prior to full CLQ sweeps. Benchmarks indicate synergy between hybrid retrieval and reranking, operational recommendations for ANN versus quantization trade-offs, and concrete thresholds for maintaining SLA-compliant latency.

6. Applications, Extensions, and Future Directions

  • Plug-and-play Hallucination Detection: RAGLens supports application to any SAE-enabled LLM without retraining, enabling lightweight deployment in post-processing pipelines.
  • Post-Hoc Mitigation: Instance- and token-level explanations furnished by RAGLens can be re-used to guide the model toward higher factuality.
  • Causal Manipulation: Direct edits to SAE activations demonstrate potential for active steering toward faithful behavior.
  • Broader Evaluation: RAGLens metrics and diagnostics enable practitioners to reproducibly audit, compare, and optimize RAG stack choices across budget, latency, and utility dimensions while making retrieval and ordering headroom explicit.
  • Potential Extensions: Integration of SAE-based feature tracing into real-time generation, expansion to other failure axes such as bias, and adoption of improved sparsity-regularized algorithms for finer interpretability are immediate open lines for research (Xiong et al., 9 Dec 2025).

A plausible implication is that RAGLens, by connecting interpretability-driven detection with operationally grounded evaluation, establishes a framework for trustworthy and cost-effective RAG system deployment spanning both research and production settings.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to RAGLens.