Risk Thresholds for Frontier AI
The paper "Risk thresholds for frontier AI" by Leonie Koessler, Jonas Schuett, and Markus Anderljung from the Centre for the Governance of AI explores the use of risk thresholds in managing the development and deployment of frontier AI systems, which include highly capable general-purpose models. These systems pose a growing risk to public safety and security, prompting scrutiny over how these risks can be systematically managed.
Summary of Concepts
The authors distinguish between three types of thresholds:
- Compute Thresholds: These are defined by the computational resources used in training models. While easy to measure, compute thresholds serve as an initial filter rather than an accurate proxy for risk.
- Capability Thresholds: These thresholds rely on evaluating the model's abilities, which can indicate potential risks. Though a better risk proxy than compute thresholds, these are still not comprehensive.
- Risk Thresholds: These thresholds directly estimate potential risk but are challenging to measure accurately at present. They represent a more principled approach to decision-making.
Application of Risk Thresholds
The paper advocates using risk thresholds to inform high-stakes AI development and deployment decisions, either directly by evaluating the risk posed by a decision or indirectly by setting capability thresholds that then inform decisions.
Direct Application:
- In direct applications, companies make specific decisions by comparing risk estimates to predefined risk thresholds. If the estimated risk surpasses the defined threshold, additional safety measures are required before proceeding.
Indirect Application:
- Indirectly, risk thresholds help set capability thresholds through "risk models" that map pathways from risk factors to harm. These models aim to identify capabilities where risks might exceed acceptable thresholds and mandate necessary safety measures to mitigate those risks.
Arguments For and Against Risk Thresholds
Favorable Points:
- Alignment with Societal Concerns: Risk thresholds emphasize societal harms, ensuring that risks accepted by companies reflect acceptable societal thresholds.
- Consistency: Risk thresholds standardize safety resource allocation across different risk types using uniform metrics.
- Actionable Estimates: They can ensure that the outputs of risk estimation processes are used in the decision-making processes rather than being ignored.
- Motivated Reasoning Reduction: By setting thresholds beforehand, companies are less likely to rationalize unacceptable risks post-factum.
- Future-Proofing: Risk thresholds reduce the need to lock in specific premature safety measures, promoting flexible and adaptable safety practices.
Concerns:
- Estimation Challenges: Accurate risk estimation is extremely difficult due to the novelty and potential impact of AI technologies.
- General-Purpose Nature of AI: AI's dual-use nature makes it hard to foresee all possible consequences.
- Incentive for Low Estimates: Thresholds might drive organizations to manipulate risk estimates downward to meet acceptable thresholds.
- Defining Acceptable Risk Levels: Establishing universally acceptable risk levels, especially in the absence of historical data, involves complex, normatively challenging decisions.
Defining Risk Thresholds
The paper proposes a structured framework for defining risk thresholds, considering the type and scope of risk, and handling key normative trade-offs involving the balance of harms and benefits, mitigation costs, and uncertainties.
Type of Risk:
- This involves specifying which risks the threshold will apply to, such as fatalities or economic damages, and distinguishing them further by their domain or modality. Temporal and territorial scopes must also be clarified.
Level of Risk:
- Acceptable risk levels can be determined through various approaches: analyzing revealed preferences, emulating other industries, or conducting systematic cost-benefit analyses. Normative judgments involved in these decisions include weighing the comparative value of different harms and benefits and accounting for mitigation costs and uncertainty.
Conclusion and Implications
The authors conclude that while risk thresholds represent an ideal tool for AI regulation, their application should presently be limited to informing decisions rather than determining them due to the challenges in precise risk estimation. They advocate further research to improve risk estimation methodologies, develop comprehensive risk models, and gather empirical data on risk scenarios. The practical recommendation is for AI companies to start employing risk thresholds to refine capability thresholds and for regulators to step in as the primary definers of risk thresholds once methodologies improve.
In summary, the paper outlines a structured approach to using risk thresholds in frontier AI regulation, stressing the need for both immediate application in an advisory capacity and long-term improvements in risk estimation techniques to support more definitive regulatory mechanisms.