Dice Question Streamline Icon: https://streamlinehq.com

Required rigor for high-stakes AI evaluations

Ascertain how to determine the appropriate level of rigor for evaluations of decision-making AI systems in high-stakes settings, relative to other applications, so that confidence levels match the potential consequences of deployment.

Information Square Streamline Icon: https://streamlinehq.com

Background

Different deployment contexts demand different levels of evaluation confidence. High-stakes uses (e.g., health, finance, critical infrastructure) arguably require stronger assurance than consumer applications, but current practice lacks principled guidance on tailoring rigor.

Clarifying how to set and justify rigor thresholds by use case would support risk-proportional governance and compliance efforts.

References

In addition, evaluations for decision-making systems in high-stakes settings will likely demand a higher level of confidence than other applications, but it is unclear how to determine the required level of rigor based on use case.

Open Problems in Technical AI Governance (2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.3.1 Reliable Evaluations