Number of successful attacks required for catastrophic harm in deployment

Determine the number of successful sabotage side tasks required to cause catastrophic harm in deployments of language-model-based software engineering agents monitored asynchronously, within a specific system architecture and threat model, to enable accurate and defensible safety estimation for real deployments.

Background

The paper’s deployment model and safety simulations show that the probability of safe operation depends strongly on how many successful attacks are required to cause harm. In simulations, safety improves markedly as this parameter increases from one to three, but the authors note that the true value in a real deployment would not be directly known.

In Appendix C, the authors emphasize that some deployment parameters will remain uncertain even at deployment time. Among these, they specifically highlight the number of attacks required to cause harm as a key unknown that materially affects safety estimates. Establishing this quantity is necessary for translating monitor performance into meaningful operational risk bounds.

References

But some uncertainties will remain, such as how many attacks are needed to cause harm, or whether attacks are independent (although we may be able to get some idea by careful threat modelling and additional measurements).

— Async Control: Stress-testing Asynchronous Control Measures for LLM Agents (2512.13526 - Stickland et al., 15 Dec 2025) in Appendix C, Deployment Simulation Details (Table 4 context)

Number of successful attacks required for catastrophic harm in deployment

Background

References

Related Problems