Safe use of confession signals for training interventions

Determine whether there exists any safe method to use signals derived from ConfessionReport outputs to train models for interventions beyond monitoring, without undermining confession honesty or incentivizing obfuscation; specifically, ascertain whether the confession signal can be incorporated into training of the primary task policy while preserving its integrity.

Background

The paper proposes training LLMs to produce confessions—structured self-assessments of compliance with instructions and policies—whose rewards are kept separate from rewards for the original task to avoid penalizing honesty. Confessions are intended primarily for monitoring and diagnostics rather than directly shaping the model’s behavior on the task.

In comparing confessions to chain-of-thought monitoring, the authors emphasize that confessions should be a "safe space" not used to reinforce the original rollout, and they explicitly flag uncertainty about whether confession signals can be safely used for training. Establishing safe ways to leverage confession signals for training would materially expand their utility beyond monitoring.

References

It remains an open question, which we leave for future work, whether there is any safe way to use the signal from confessions for training (and not just monitoring) interventions.

— Training LLMs for Honesty via Confessions (2512.08093 - Joglekar et al., 8 Dec 2025) in Section 9.2 (Comparison to chain-of-thought monitoring)

Safe use of confession signals for training interventions

Background

References

Related Problems