Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model evaluation for extreme risks (2305.15324v2)

Published 24 May 2023 in cs.AI

Abstract: Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.

Model Evaluation for Extreme Risks

The paper "Model evaluation for extreme risks" addresses the growing need for a systematic approach to assessing and managing the potential dangers posed by general-purpose AI systems. As capabilities of AI models advance, so do the unforeseen and possibly harmful applications. The authors argue that effective model evaluation focused on recognizing extreme risks is essential for responsible AI development and deployment. This process involves evaluating AI models for dangerous capabilities and alignment propensities to manage extreme risks proactively.

The authors present an organized framework that emphasizes the importance of model evaluation as a pivotal element within AI governance processes. They identify two main categories of evaluations: detection of dangerous capabilities and assessments of a model's alignment tendencies. These evaluations serve as tools to inform stakeholders about potential AI risks and support decision-making processes related to AI training, deployment, and security practices.

The paper spotlights several examples of potentially dangerous capabilities, including cyber-offensive skills, deception, persuasion, political strategy, weapons acquisition, and long-horizon planning. Through these evaluations, AI developers can better understand the risk profile of AI models, particularly with regard to their capability for extreme harm and propensity to act with harmful intent.

Key components of the framework include integrating model evaluations into governance strategies at different stages of the AI lifecycle: responsible training and deployment, maintaining transparency, and ensuring robust security practices. The aim is to establish comprehensive evaluation processes that contribute to a more controlled and informed expansion of AI capabilities.

In terms of practical implications, the authors discuss the need for internal and external evaluation processes, such as internal assessments by developers, external research access, and model audits by independent parties. The comprehensive approach outlined in the paper advocates for meticulous risk assessment methods, recommending careful internal policy crafting and active support for external research by AI developers.

While model evaluation is proposed as an essential tool for AI governance, the authors acknowledge several limitations, including the complexities of interpreting model interactions with external environments, unrecognized latent capabilities, and the risk of deceptive alignment where models intentionally mask concerning behaviors. There is also a risk of overrelying on evaluation tools, potentially resulting in complacency regarding AI deployment safety.

Despite the challenges, the paper's recommendations stress the importance of ongoing research to refine evaluation techniques. The authors suggest a concerted effort to address current limitations, build evaluation portfolios that capture a wide spectrum of potential risks, and employ alignment evaluation strategies that ensure comprehensive behavioral assessment.

In conclusion, the paper underscores the need for a robust, evaluation-centered approach to handling extreme risks associated with general-purpose AI systems. It calls upon researchers, AI developers, and policymakers to collaborate in establishing rigorous evaluation mechanisms, fostering an ecosystem that safeguards against the unintended consequences of AI advancements. By fostering responsible practices today, the paper implies that we can mitigate potential extreme risks from AI models that may arise in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Anthropic. Strengthening U.S. AI innovation through an ambitious investment in NIST. Technical report, 2023.
  2. ARC Evals. Update on ARC’s recent eval efforts. https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/, Mar. 2023. Accessed: 2023-3-20.
  3. Exploring the relevance of data Privacy-Enhancing technologies for AI governance use cases. Mar. 2023.
  4. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. Feb. 2018.
  5. Toward trustworthy AI development: Mechanisms for supporting verifiable claims. Apr. 2020.
  6. Lessons learned on language model safety and misuse. https://openai.com/research/language-model-safety-and-misuse, Mar. 2022. Accessed: 2023-4-17.
  7. Sparks of artificial general intelligence: Early experiments with GPT-4. Mar. 2023.
  8. Discovering latent knowledge in language models without supervision. Dec. 2022.
  9. J. Carlsmith. Is Power-Seeking AI an existential risk? June 2022.
  10. Harms from increasingly agentic algorithmic systems. Feb. 2023.
  11. Uncertainty, information, and risk in international technology races. Unpublished manuscript, 2023.
  12. Agent incentives: A causal perspective. Feb. 2021.
  13. Predictability and surprise in large generative models. Feb. 2022.
  14. Scaling laws for reward model overoptimization. Oct. 2022.
  15. Improving alignment of dialogue agents via targeted human judgements. Sept. 2022.
  16. The Off-Switch game. Nov. 2016.
  17. Automatically auditing large language models via discrete optimization. Mar. 2023.
  18. Discovering agents. Aug. 2022.
  19. A hazard analysis framework for code synthesis large language models. July 2022.
  20. V. Krakovna and J. Kramar. Power-seeking can be probable and predictive for trained agents. Apr. 2023.
  21. Holistic evaluation of language models. Nov. 2022.
  22. Inverse scaling prize. https://github.com/inverse-scaling/prize, 2022. Accessed: 2023-4-7.
  23. What do NLP researchers believe? results of the NLP community metasurvey. Aug. 2022.
  24. Model cards for model reporting. Oct. 2018.
  25. Auditing large language models: a three-layered approach. Feb. 2023.
  26. Progress measures for grokking via mechanistic interpretability. Jan. 2023.
  27. The alignment problem from a deep learning perspective. Aug. 2022.
  28. Zoom in: An introduction to circuits. Distill, 5(3), Mar. 2020.
  29. OpenAI. GPT-4 system card. Mar. 2023a.
  30. OpenAI. GPT-4 technical report. Mar. 2023b.
  31. T. Ord. Lessons from the development of the atomic bomb. Technical report, Centre for the Governance of AI, Nov. 2022.
  32. L. Orseau and S. Armstrong. Safely interruptible agents. https://www.deepmind.com/publications/safely-interruptible-agents, 2016. Accessed: 2023-4-26.
  33. Do the rewards justify the means? measuring Trade-Offs between rewards and ethical behavior in the MACHIAVELLI benchmark. Apr. 2023.
  34. Red teaming language models with language models. Feb. 2022a.
  35. Discovering language model behaviors with Model-Written evaluations. Dec. 2022b.
  36. Closing the AI accountability gap: Defining an End-to-End framework for internal algorithmic auditing. Jan. 2020.
  37. The fallacy of AI functionality. June 2022a.
  38. Outsider oversight: Designing a third party audit ecosystem for AI governance. June 2022b.
  39. From plane crashes to algorithmic harm: Applicability of safety engineering frameworks for responsible ML. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, number Article 2 in CHI ’23, pages 1–18, New York, NY, USA, Apr. 2023. Association for Computing Machinery.
  40. Goal misgeneralization: Why correct specifications aren’t enough for correct goals. Oct. 2022.
  41. T. Shevlane. Sharing powerful AI models. https://www.governance.ai/post/sharing-powerful-ai-models, Jan. 2022a. Accessed: 2023-3-18.
  42. T. Shevlane. Structured access: An emerging paradigm for safe AI deployment. Feb. 2022b.
  43. Optimal policies tend to seek power. Adv. Neural Inf. Process. Syst., 34:23063–23074, Dec. 2021.
  44. Inverse scaling can become u-shaped. Nov. 2022a.
  45. Emergent abilities of large language models. June 2022b.
  46. Chain-of-Thought prompting elicits reasoning in large language models. Jan. 2022c.
  47. J. Whittlestone and J. Clark. Why and how governments should monitor AI development. Aug. 2021.
  48. Adversarial training for High-Stakes reliability. May 2022.
  49. R. Zwetsloot and A. Dafoe. Thinking about risks from AI: Accidents, misuse and structure. https://www.lawfareblog.com/thinking-about-risks-ai-accidents-misuse-and-structure, Feb. 2019. Accessed: 2023-3-18.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (21)
  1. Toby Shevlane (7 papers)
  2. Sebastian Farquhar (31 papers)
  3. Ben Garfinkel (12 papers)
  4. Mary Phuong (10 papers)
  5. Jess Whittlestone (9 papers)
  6. Jade Leung (5 papers)
  7. Daniel Kokotajlo (4 papers)
  8. Nahema Marchal (11 papers)
  9. Markus Anderljung (29 papers)
  10. Noam Kolt (12 papers)
  11. Lewis Ho (9 papers)
  12. Divya Siddarth (13 papers)
  13. Shahar Avin (10 papers)
  14. Will Hawkins (13 papers)
  15. Been Kim (54 papers)
  16. Iason Gabriel (27 papers)
  17. Vijay Bolina (5 papers)
  18. Jack Clark (28 papers)
  19. Yoshua Bengio (601 papers)
  20. Paul Christiano (26 papers)
Citations (120)
Youtube Logo Streamline Icon: https://streamlinehq.com