Evaluations that sufficiently cover model vulnerabilities

Determine methods and criteria to assess whether an evaluation procedure has identified most vulnerabilities of a given AI system, especially for capabilities that could enable harmful misuse or make systems difficult to oversee or control.

Background

The paper highlights that behavioral evaluations can miss important failure modes, providing at best lower bounds on harmful capabilities. This undermines confidence in safety assessments, particularly for high-risk behaviors such as deception or long-horizon planning.

A clear methodology to judge the thoroughness and sufficiency of evaluations would enable stronger assurance and better governance decisions in settings where failure costs are high.

References

Determining whether an evaluation procedure has identified all, if not most, of the vulnerabilities of a system is an open problem.

— Open Problems in Technical AI Governance (2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.3.1 Reliable Evaluations

Evaluations that sufficiently cover model vulnerabilities

Sponsor

Background

References

Related Problems