Realistic constraints on manipulation of honest calibration episodes

Ascertain realistic constraints on the degree to which an untrusted language model agent could manipulate “honest” calibration episodes to raise monitor suspiciousness without triggering shutdown, and incorporate these constraints into monitor calibration and control evaluations to prevent exploitable degradation.

Background

Monitors are calibrated to a fixed false positive rate using honest episodes. The authors show in Appendix E that allowing the red team to influence this calibration data can degrade monitor performance by several percentage points, revealing a potential vulnerability.

However, the authors state it is difficult to know what degree of such manipulation is realistic in practice. Determining realistic bounds on this manipulation is necessary to design robust calibration procedures and fair, defensible control evaluations.

References

We show blue team results are worse if we allow the red team to modify honest episodes in Appendix E (although it is difficult to know what degree of modification is realistic).

— Async Control: Stress-testing Asynchronous Control Measures for LLM Agents (2512.13526 - Stickland et al., 15 Dec 2025) in Section 6, Discussion (Reasons why we may overestimate blue team performance)

Realistic constraints on manipulation of honest calibration episodes

Background

References

Related Problems