Can SAE features capture RAG-specific hallucination dynamics?

Determine whether sparse autoencoder-derived features learned from large language model hidden states can effectively capture the complex interactions between retrieved evidence and generated content in retrieval-augmented generation, thereby reflecting the dynamics that give rise to RAG-specific hallucinations.

Background

Retrieval-Augmented Generation (RAG) aims to ground LLM outputs in retrieved passages, yet models frequently produce unfaithful content that contradicts or extends beyond the sources. While sparse autoencoders (SAEs) have recently been shown to expose semantically meaningful features within LLM hidden states, prior work has focused primarily on generic hallucination signals unrelated to the retrieval-conditioning context.

The authors note that RAG hallucinations involve a complex interplay between retrieved evidence and generated text, raising the question of whether SAE-derived features can represent these specific dynamics. They frame this uncertainty explicitly before presenting RAGLens, their SAE-based detector, which they claim addresses this challenge empirically.

References

While recent work has explored the use of SAEs to detect signals associated with generic LLM hallucinations (Ferrando et al., 2025; Suresh et al., 2025; Ab- daljalil et al., 2025; Tillman & Mossing, 2025; Xin et al., 2025), hallucinations in RAG settings pose unique challenges due to the complex interplay between retrieved evidence and generated content. It remains unclear whether SAE features can effectively capture these dynamics.

Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders (2512.08892 - Xiong et al., 9 Dec 2025) in Section 1 (Introduction)