Can SAE features capture RAG-specific hallucination dynamics?
Determine whether sparse autoencoder-derived features learned from large language model hidden states can effectively capture the complex interactions between retrieved evidence and generated content in retrieval-augmented generation, thereby reflecting the dynamics that give rise to RAG-specific hallucinations.
Sponsor
References
While recent work has explored the use of SAEs to detect signals associated with generic LLM hallucinations (Ferrando et al., 2025; Suresh et al., 2025; Ab- daljalil et al., 2025; Tillman & Mossing, 2025; Xin et al., 2025), hallucinations in RAG settings pose unique challenges due to the complex interplay between retrieved evidence and generated content. It remains unclear whether SAE features can effectively capture these dynamics.