Model evaluation for extreme risks (2305.15324v2)
Abstract: Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.
- Anthropic. Strengthening U.S. AI innovation through an ambitious investment in NIST. Technical report, 2023.
- ARC Evals. Update on ARC’s recent eval efforts. https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/, Mar. 2023. Accessed: 2023-3-20.
- Exploring the relevance of data Privacy-Enhancing technologies for AI governance use cases. Mar. 2023.
- The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. Feb. 2018.
- Toward trustworthy AI development: Mechanisms for supporting verifiable claims. Apr. 2020.
- Lessons learned on language model safety and misuse. https://openai.com/research/language-model-safety-and-misuse, Mar. 2022. Accessed: 2023-4-17.
- Sparks of artificial general intelligence: Early experiments with GPT-4. Mar. 2023.
- Discovering latent knowledge in language models without supervision. Dec. 2022.
- J. Carlsmith. Is Power-Seeking AI an existential risk? June 2022.
- Harms from increasingly agentic algorithmic systems. Feb. 2023.
- Uncertainty, information, and risk in international technology races. Unpublished manuscript, 2023.
- Agent incentives: A causal perspective. Feb. 2021.
- Predictability and surprise in large generative models. Feb. 2022.
- Scaling laws for reward model overoptimization. Oct. 2022.
- Improving alignment of dialogue agents via targeted human judgements. Sept. 2022.
- The Off-Switch game. Nov. 2016.
- Automatically auditing large language models via discrete optimization. Mar. 2023.
- Discovering agents. Aug. 2022.
- A hazard analysis framework for code synthesis large language models. July 2022.
- V. Krakovna and J. Kramar. Power-seeking can be probable and predictive for trained agents. Apr. 2023.
- Holistic evaluation of language models. Nov. 2022.
- Inverse scaling prize. https://github.com/inverse-scaling/prize, 2022. Accessed: 2023-4-7.
- What do NLP researchers believe? results of the NLP community metasurvey. Aug. 2022.
- Model cards for model reporting. Oct. 2018.
- Auditing large language models: a three-layered approach. Feb. 2023.
- Progress measures for grokking via mechanistic interpretability. Jan. 2023.
- The alignment problem from a deep learning perspective. Aug. 2022.
- Zoom in: An introduction to circuits. Distill, 5(3), Mar. 2020.
- OpenAI. GPT-4 system card. Mar. 2023a.
- OpenAI. GPT-4 technical report. Mar. 2023b.
- T. Ord. Lessons from the development of the atomic bomb. Technical report, Centre for the Governance of AI, Nov. 2022.
- L. Orseau and S. Armstrong. Safely interruptible agents. https://www.deepmind.com/publications/safely-interruptible-agents, 2016. Accessed: 2023-4-26.
- Do the rewards justify the means? measuring Trade-Offs between rewards and ethical behavior in the MACHIAVELLI benchmark. Apr. 2023.
- Red teaming language models with language models. Feb. 2022a.
- Discovering language model behaviors with Model-Written evaluations. Dec. 2022b.
- Closing the AI accountability gap: Defining an End-to-End framework for internal algorithmic auditing. Jan. 2020.
- The fallacy of AI functionality. June 2022a.
- Outsider oversight: Designing a third party audit ecosystem for AI governance. June 2022b.
- From plane crashes to algorithmic harm: Applicability of safety engineering frameworks for responsible ML. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, number Article 2 in CHI ’23, pages 1–18, New York, NY, USA, Apr. 2023. Association for Computing Machinery.
- Goal misgeneralization: Why correct specifications aren’t enough for correct goals. Oct. 2022.
- T. Shevlane. Sharing powerful AI models. https://www.governance.ai/post/sharing-powerful-ai-models, Jan. 2022a. Accessed: 2023-3-18.
- T. Shevlane. Structured access: An emerging paradigm for safe AI deployment. Feb. 2022b.
- Optimal policies tend to seek power. Adv. Neural Inf. Process. Syst., 34:23063–23074, Dec. 2021.
- Inverse scaling can become u-shaped. Nov. 2022a.
- Emergent abilities of large language models. June 2022b.
- Chain-of-Thought prompting elicits reasoning in large language models. Jan. 2022c.
- J. Whittlestone and J. Clark. Why and how governments should monitor AI development. Aug. 2021.
- Adversarial training for High-Stakes reliability. May 2022.
- R. Zwetsloot and A. Dafoe. Thinking about risks from AI: Accidents, misuse and structure. https://www.lawfareblog.com/thinking-about-risks-ai-accidents-misuse-and-structure, Feb. 2019. Accessed: 2023-3-18.
- Toby Shevlane (7 papers)
- Sebastian Farquhar (31 papers)
- Ben Garfinkel (12 papers)
- Mary Phuong (10 papers)
- Jess Whittlestone (9 papers)
- Jade Leung (5 papers)
- Daniel Kokotajlo (4 papers)
- Nahema Marchal (11 papers)
- Markus Anderljung (29 papers)
- Noam Kolt (12 papers)
- Lewis Ho (9 papers)
- Divya Siddarth (13 papers)
- Shahar Avin (10 papers)
- Will Hawkins (13 papers)
- Been Kim (54 papers)
- Iason Gabriel (27 papers)
- Vijay Bolina (5 papers)
- Jack Clark (28 papers)
- Yoshua Bengio (601 papers)
- Paul Christiano (26 papers)