Realistic constraints on manipulation of honest calibration episodes
Ascertain realistic constraints on the degree to which an untrusted language model agent could manipulate “honest” calibration episodes to raise monitor suspiciousness without triggering shutdown, and incorporate these constraints into monitor calibration and control evaluations to prevent exploitable degradation.
References
We show blue team results are worse if we allow the red team to modify honest episodes in Appendix E (although it is difficult to know what degree of modification is realistic).
— Async Control: Stress-testing Asynchronous Control Measures for LLM Agents
(2512.13526 - Stickland et al., 15 Dec 2025) in Section 6, Discussion (Reasons why we may overestimate blue team performance)