Risk thresholds for frontier AI (2406.14713v1)

Published 20 Jun 2024 in cs.CY

Abstract: Frontier AI systems could pose increasing risks to public safety and security. But what level of risk is acceptable? One increasingly popular approach is to define capability thresholds, which describe AI capabilities beyond which an AI system is deemed to pose too much risk. A more direct approach is to define risk thresholds that simply state how much risk would be too much. For instance, they might state that the likelihood of cybercriminals using an AI system to cause X amount of economic damage must not increase by more than Y percentage points. The main upside of risk thresholds is that they are more principled than capability thresholds, but the main downside is that they are more difficult to evaluate reliably. For this reason, we currently recommend that companies (1) define risk thresholds to provide a principled foundation for their decision-making, (2) use these risk thresholds to help set capability thresholds, and then (3) primarily rely on capability thresholds to make their decisions. Regulators should also explore the area because, ultimately, they are the most legitimate actors to define risk thresholds. If AI risk estimates become more reliable, risk thresholds should arguably play an increasingly direct role in decision-making.

Authors (3)

Leonie Koessler (6 papers)
Jonas Schuett (20 papers)
Markus Anderljung (29 papers)

Citations (6)

View on Semantic Scholar

Summary

Risk Thresholds for Frontier AI

The paper "Risk thresholds for frontier AI" by Leonie Koessler, Jonas Schuett, and Markus Anderljung from the Centre for the Governance of AI explores the use of risk thresholds in managing the development and deployment of frontier AI systems, which include highly capable general-purpose models. These systems pose a growing risk to public safety and security, prompting scrutiny over how these risks can be systematically managed.

Summary of Concepts

The authors distinguish between three types of thresholds:

Compute Thresholds: These are defined by the computational resources used in training models. While easy to measure, compute thresholds serve as an initial filter rather than an accurate proxy for risk.
Capability Thresholds: These thresholds rely on evaluating the model's abilities, which can indicate potential risks. Though a better risk proxy than compute thresholds, these are still not comprehensive.
Risk Thresholds: These thresholds directly estimate potential risk but are challenging to measure accurately at present. They represent a more principled approach to decision-making.

Application of Risk Thresholds

The paper advocates using risk thresholds to inform high-stakes AI development and deployment decisions, either directly by evaluating the risk posed by a decision or indirectly by setting capability thresholds that then inform decisions.

Direct Application:

In direct applications, companies make specific decisions by comparing risk estimates to predefined risk thresholds. If the estimated risk surpasses the defined threshold, additional safety measures are required before proceeding.

Indirect Application:

Indirectly, risk thresholds help set capability thresholds through "risk models" that map pathways from risk factors to harm. These models aim to identify capabilities where risks might exceed acceptable thresholds and mandate necessary safety measures to mitigate those risks.

Arguments For and Against Risk Thresholds

Favorable Points:

Alignment with Societal Concerns: Risk thresholds emphasize societal harms, ensuring that risks accepted by companies reflect acceptable societal thresholds.
Consistency: Risk thresholds standardize safety resource allocation across different risk types using uniform metrics.
Actionable Estimates: They can ensure that the outputs of risk estimation processes are used in the decision-making processes rather than being ignored.
Motivated Reasoning Reduction: By setting thresholds beforehand, companies are less likely to rationalize unacceptable risks post-factum.
Future-Proofing: Risk thresholds reduce the need to lock in specific premature safety measures, promoting flexible and adaptable safety practices.

Concerns:

Estimation Challenges: Accurate risk estimation is extremely difficult due to the novelty and potential impact of AI technologies.
General-Purpose Nature of AI: AI's dual-use nature makes it hard to foresee all possible consequences.
Incentive for Low Estimates: Thresholds might drive organizations to manipulate risk estimates downward to meet acceptable thresholds.
Defining Acceptable Risk Levels: Establishing universally acceptable risk levels, especially in the absence of historical data, involves complex, normatively challenging decisions.

Defining Risk Thresholds

The paper proposes a structured framework for defining risk thresholds, considering the type and scope of risk, and handling key normative trade-offs involving the balance of harms and benefits, mitigation costs, and uncertainties.

Type of Risk:

This involves specifying which risks the threshold will apply to, such as fatalities or economic damages, and distinguishing them further by their domain or modality. Temporal and territorial scopes must also be clarified.

Level of Risk:

Acceptable risk levels can be determined through various approaches: analyzing revealed preferences, emulating other industries, or conducting systematic cost-benefit analyses. Normative judgments involved in these decisions include weighing the comparative value of different harms and benefits and accounting for mitigation costs and uncertainty.

Conclusion and Implications

The authors conclude that while risk thresholds represent an ideal tool for AI regulation, their application should presently be limited to informing decisions rather than determining them due to the challenges in precise risk estimation. They advocate further research to improve risk estimation methodologies, develop comprehensive risk models, and gather empirical data on risk scenarios. The practical recommendation is for AI companies to start employing risk thresholds to refine capability thresholds and for regulators to step in as the primary definers of risk thresholds once methodologies improve.

In summary, the paper outlines a structured approach to using risk thresholds in frontier AI regulation, stressing the need for both immediate application in an advisory capacity and long-term improvements in risk estimation techniques to support more definitive regulatory mechanisms.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jonasschuett/status/1805166361481449898

https://twitter.com/LongResilience/status/1805573156075077938

https://twitter.com/Leo_Koe_/status/1805165733007036564

https://twitter.com/cackerman21/status/1821211178351075369