Safe use of confession signals for training interventions
Determine whether there exists any safe method to use signals derived from ConfessionReport outputs to train models for interventions beyond monitoring, without undermining confession honesty or incentivizing obfuscation; specifically, ascertain whether the confession signal can be incorporated into training of the primary task policy while preserving its integrity.
References
It remains an open question, which we leave for future work, whether there is any safe way to use the signal from confessions for training (and not just monitoring) interventions.
— Training LLMs for Honesty via Confessions
(2512.08093 - Joglekar et al., 8 Dec 2025) in Section 9.2 (Comparison to chain-of-thought monitoring)