Predicting LLM Failures from Internal Generation Dynamics

Determine whether large language models can anticipate their own failures by examining their internal generation dynamics, specifically the evolving hidden states and attention-routing patterns during inference, to enable intrinsic self-verification without external judges or multi-sample consistency.

Background

LLMs often produce fluent but incorrect outputs and struggle to recognize their own mistakes. Most existing correctness estimation methods rely on signals external to the model’s internal dynamics, such as external judges, multi-sample agreement, or text-based confidence, which add compute or correlate weakly with true correctness.

This work proposes Gnosis, a lightweight self-awareness mechanism that decodes reliability cues from internal hidden states and attention patterns. The authors explicitly highlight the broader research uncertainty as to whether correctness can be predicted directly from the intrinsic generation process rather than external supervision, motivating their investigation into trajectory-level internal signals.

References

A fundamental open question is whether LLMs can anticipate their own failures by examining the internal dynamics that govern their generation process.

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits  (2512.20578 - Ghasemabadi et al., 23 Dec 2025) in Section 1 (Introduction)