Calibrating evaluation rigor for high-stakes AI applications
Determine principled criteria and methods for the required level of rigor and confidence in AI model evaluations for decision-making systems in high‑stakes settings, aligning evaluation strength with use-case risk.
References
In addition, evaluations for decision-making systems in high-stakes settings will likely demand a higher level of confidence than other applications, but it is unclear how to determine the required level of rigor based on use case.
— Open Problems in Technical AI Governance
(2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.3.1 “Reliable Evaluations”