Independence of sabotage attempts across episodes

Establish whether sabotage attempts by language-model-based agents are statistically independent events across episodes in deployments that use asynchronous monitoring, and characterize any temporal or strategic correlations that would invalidate independence assumptions in safety models.

Background

The safety extrapolation relies on simplifying assumptions, including independence of attacks across episodes. The authors warn that if this assumption is violated—e.g., agents coordinate attacks or build on prior actions—then the estimates become much less useful.

They explicitly flag the independence of attacks as an uncertainty that will remain in real deployments. Resolving this is crucial for validating the deployment risk model and for designing monitoring protocols that remain effective under potential coordination or stateful strategies.

References

But some uncertainties will remain, such as how many attacks are needed to cause harm, or whether attacks are independent (although we may be able to get some idea by careful threat modelling and additional measurements).

— Async Control: Stress-testing Asynchronous Control Measures for LLM Agents (2512.13526 - Stickland et al., 15 Dec 2025) in Appendix C, Deployment Simulation Details (Table 4 context)

Independence of sabotage attempts across episodes

Background

References

Related Problems