Coordinated pausing: An evaluation-based coordination scheme for frontier AI developers (2310.00374v1)
Abstract: As AI models are scaled up, new capabilities can emerge unintentionally and unpredictably, some of which might be dangerous. In response, dangerous capabilities evaluations have emerged as a new risk assessment tool. But what should frontier AI developers do if sufficiently dangerous capabilities are in fact discovered? This paper focuses on one possible response: coordinated pausing. It proposes an evaluation-based coordination scheme that consists of five main steps: (1) Frontier AI models are evaluated for dangerous capabilities. (2) Whenever, and each time, a model fails a set of evaluations, the developer pauses certain research and development activities. (3) Other developers are notified whenever a model with dangerous capabilities has been discovered. They also pause related research and development activities. (4) The discovered capabilities are analyzed and adequate safety precautions are put in place. (5) Developers only resume their paused activities if certain safety thresholds are reached. The paper also discusses four concrete versions of that scheme. In the first version, pausing is completely voluntary and relies on public pressure on developers. In the second version, participating developers collectively agree to pause under certain conditions. In the third version, a single auditor evaluates models of multiple developers who agree to pause if any model fails a set of evaluations. In the fourth version, developers are legally required to run evaluations and pause if dangerous capabilities are discovered. Finally, the paper discusses the desirability and feasibility of our proposed coordination scheme. It concludes that coordinated pausing is a promising mechanism for tackling emerging risks from frontier AI models. However, a number of practical and legal obstacles need to be overcome, especially how to avoid violations of antitrust law.
- AI Objectives Institute. Whitepaper. https://aiobjectives.org/whitepaper, 2023.
- S. Altman. Written testimony. U.S. Senate Committee on the Judiciary, Subcommittee on Privacy, Technology, & the Law, https://www.judiciary.senate.gov/imo/media/doc/2023-05-16%20-%20Bio%20&%20Testimony%20-%20Altman.pdf, 2023.
- D. Amodei. Written testimony. U.S. Senate Committee on the Judiciary, Subcommittee on Privacy, Technology, & the Law, https://www.judiciary.senate.gov/imo/media/doc/2023-07-26_-_testimony_-_amodei.pdf, 2023.
- Frontier AI regulation: Managing emerging risks to public safety. arXiv preprint arXiv:2307.03718, 2023.
- Anthropic. Introducing Claude. https://www.anthropic.com/index/introducing-claude, 2023.
- Anthropic. Model card and evaluations for Claude models. https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf, 2023.
- Anthropic. Responsible scaling policy. https://www.anthropic.com/index/anthropics-responsible-scaling-policy, 2023.
- Apollo Research. Announcing Apollo Research. https://www.apolloresearch.ai/blog/announcement, 2023.
- ARC Evals. https://evals.alignment.org, 2023.
- ARC Evals. Responsible scaling policies (RSPs). https://evals.alignment.org/blog/2023-09-26-rsp, 2023.
- ARC Evals. Update on ARC’s recent eval efforts. https://evals.alignment.org/rsp-key-components, 2023.
- Racing to the precipice: A model of artificial intelligence development. AI & Society, 31:201–206, 2016.
- Guidelines for artificial intelligence containment. In A. E. Abbas, editor, Next Generation Ethics: Engineering a Better Society, pages 90–112. Cambridge University Press, 2019.
- Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022.
- Y. Bengio. Slowing down development of AI systems passing the Turing test. https://yoshuabengio.org/2023/04/05/slowing-down-development-of-ai-systems-passing-the-turing-test, 2023.
- N. Bostrom. Existential risks: Analyzing human extinction scenarios and related hazards. Journal of Evolution and Technology, 9(1), 2002.
- B. Bucknall and R. Trager. Structured access for third-party research on frontier AI models: Investigating researchers’ model access requirements. forthcoming, 2023.
- Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
- J. Carlsmith. Is power-seeking AI an existential risk? arXiv preprint arXiv:2206.13353, 2022.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- P. Christiano. Thoughts on sharing information about language model capabilities. AI Alignment Forum, https://www.alignmentforum.org/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model, 2023.
- Deep reinforcement learning from human preferences. arXiv preprint arXiv:1706.03741, 2017.
- Corporate governance of artificial intelligence in the public interest. Information, 12(7), 2021.
- Best practices for deploying language models. https://openai.com/blog/best-practices-for-deploying-language-models, 2022.
- A. Cotra. Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. AI Alignment Forum, https://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to, 2022.
- J. Ding and J. Xiao. Recent trends in China’s large language model landscape. Centre for the Governance of AI, https://www.governance.ai/research-paper/recent-trends-chinas-llm-landscape, 2023.
- O. Evans. https://owainevans.github.io, 2023.
- Chinese firms are evading chip controls. Foreign Policy, https://foreignpolicy.com/2023/06/21/china-united-states-semiconductor-chips-sanctions-evasion, 2023.
- Future of Life Institute. Pause giant AI experiments: An open letter. https://futureoflife.org/open-letter/pause-giant-ai-experiments, 2023.
- Predictability and surprise in large generative models. In FAccT ’22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764. Association for Computing Machinery, 2022.
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. arXiv preprint arXiv:2202.06935, 2022.
- N. Grant. Google calls in help from Larry Page and Sergey Brin for A.I. fight. The New York Times, https://www.nytimes.com/2023/01/20/technology/google-chatgpt-artificial-intelligence.html, 2023.
- Gwern. The scaling hypothesis. https://gwern.net/scaling-hypothesis, 2020.
- G. K. Hadfield and J. Clark. Regulatory markets: The future of AI governance. arXiv preprint arXiv:2304.04914, 2023.
- An overview of catastrophic AI risks. arXiv preprint arXiv:2306.12001, 2023.
- HM Government. UK to host first global summit on artificial intelligence. https://www.gov.uk/government/news/uk-to-host-first-global-summit-on-artificial-intelligence, 2023.
- International institutions for advanced AI. arXiv preprint arXiv:2307.04699, 2023.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- S.-S. Hua and H. Belfield. AI & antitrust: Reconciling tensions between competition law and cooperative AI development. Yale Journal of Law & Technology, 23:415–451, 2021.
- E. Hubinger. Towards understanding-based safety evaluations. AI Alignment Forum, https://www.alignmentforum.org/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations, 2023.
- M. Ienca. Don’t pause giant AI for the wrong reasons. Nature Machine Intelligence, 5:470–471, 2023.
- A. John and K. Jackson. Will AI make cyber swords or shields? Center for Security and Emerging Technology, Georgetown University, https://doi.org/10.51593/2022CA002, 2022.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Evaluating language-model agents on realistic autonomous tasks. ARC Evals, https://evals.alignment.org/Evaluating_LMAs_Realistic_Tasks.pdf, 2023.
- L. Koessler and J. Schuett. Risk assessment at AGI companies: A review of popular risk assessment techniques from other safety-critical industries. arXiv preprint arXiv:2307.08823, 2023.
- V. Krakovna and J. Kramar. Power-seeking can be probable and predictive for trained agents. arXiv preprint arXiv:2304.06528, 2023.
- V. Krakovna and R. Shah. Some high-level thoughts on the DeepMind alignment team’s strategy. Alingmnet Forum, https://www.alignmentforum.org/posts/a9SPcZ6GXAg9cNKdi/linkpost-some-high-level-thoughts-on-the-deepmind-alignment, 2023.
- J. Ladish and L. Heim. Information security considerations for AI and the long term future. Effective Altruism Forum, https://forum.effectivealtruism.org/posts/WqQDCCLWbYfFRwubf/information-security-considerations-for-ai-and-the-long-term, 2022.
- Illustrating reinforcement learning from human feedback (RLHF). Hugging Face, https://huggingface.co/blog/rlhf, 2022.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- J. Lobert. The role of section 708 of the Defense Production Act in the Federal Government’s response to COVID-19: Antitrust considerations. Congressional Research Service, https://crsreports.congress.gov/product/pdf/LSB/LSB10534, 2020.
- A. Lohn and M. Musser. AI and compute: How much longer can computing power drive artificial intelligence progress? Center for Security and Emerging Technology, Georgetown University, https://doi.org/10.51593/2021CA009, 2022.
- S. McGregor. Preventing repeated real world AI failures by cataloging incidents: The AI Incident Database. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 15458–15463, 2021.
- Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
- L. Muehlhauser. 12 tentative ideas for US AI policy. Open Philanthropy, https://www.openphilanthropy.org/research/12-tentative-ideas-for-us-ai-policy, 2023.
- N. Nanda. 200 concrete open problems in mechanistic interpretability: Introduction. Alignmnet Forum, https://www.alignmentforum.org/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability, 2022.
- Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
- W. Naudé and N. Dimitri. The race for an artificial general intelligence: implications for public policy. AI & Society, 35:367–379, 2020.
- The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
- Deployment corrections: An incident response fromework for frontier AI models. forthcoming, 2023.
- C. Olah. Interpretability dreams. Transformer Circuits Thread, https://transformer-circuits.pub/2023/interpretability-dreams/index.html, 2023.
- OpenAI. Frontier model forum. https://openai.com/blog/frontier-model-forum, 2023.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- OpenAI. March 20 ChatGPT outage: Here’s what happened. https://openai.com/blog/march-20-chatgpt-outage, 2023.
- T. Ord. The precipice: Existential risk and the future of humanity. Hachette Books, 2020.
- AI deception: A survey of examples, risks, and potential solutions. arXiv preprint arXiv:2308.14752, 2023.
- K. Paul. Letter signed by Elon Musk demanding AI research pause sparks controversy. The Guardian, https://www.theguardian.com/technology/2023/mar/31/ai-research-pause-elon-musk-chatgpt, 2023.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022.
- J. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. arXiv preprint arXiv:2306.13952, 2023.
- Differential technology development: A responsible innovation principle for navigating technology risks. SSRN, 2022.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
- J. Schuett. AGI labs need an internal audit function. arXiv preprint arXiv:2305.17038, 2023.
- Towards best practices in AGI safety and governance: A survey of expert opinion. arXiv preprint arXiv:2305.07153, 2023.
- How to design an AI ethics board. arXiv preprint arXiv:2304.07249, 2023.
- Open-sourcing highly capable foundation models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives. Centre for the Governance of AI, https://www.governance.ai/research-paper/open-sourcing-highly-capable-foundation-models, 2023.
- Democratising AI: Multiple meanings, goals, and methods. arXiv preprint arXiv:2303.12642, 2023.
- T. Shevlane. Structured access: An emerging paradigm for safe AI deployment. In The Oxford Handbook of AI Governance, 2022.
- T. Shevlane. An early warning system for novel AI risks. Google DeepMind, https://www.deepmind.com/blog/an-early-warning-system-for-novel-ai-risks, 2023.
- Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
- A seismic shift: The new U.S. semiconductor export controls and the implications for U.S. firms, allies, and the innovation ecosystem. Center for Strategic and International Studies, https://www.csis.org/analysis/seismic-shift-new-us-semiconductor-export-controls-and-implications-us-firms-allies-and, 2022.
- I. Solaiman. The gradient of generative AI release: Methods and considerations. arXiv preprint arXiv:2302.04844, 2023.
- Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
- I. Solaiman and C. Dennison. Process for adapting language models to society (PALMS) with values-targeted datasets. arXiv preprint arXiv:2106.10328, 2021.
- Beyond neural scaling laws: Beating power law scaling via data pruning. arXiv preprint arXiv:2206.14486, 2022.
- M. Strathern. ‘Improving ratings’: Audit in the British university system. European Review, 5(3):305––321, 1997.
- Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
- The White House. Executive Order 13911. https://www.govinfo.gov/content/pkg/FR-2020-04-01/pdf/2020-06969.pdf, 2020.
- The White House. Fact sheet: Biden-Harris administration announces new actions to promote responsible AI innovation that protects Americans’ rights and safety. https://www.whitehouse.gov/briefing-room/statements-releases/2023/05/04/fact-sheet-biden-harris-administration-announces-new-actions-to-promote-responsible-ai-innovation-that-protects-americans-rights-and-safety, 2023.
- The White House. Fact sheet: Biden-Harris administration secures voluntary commitments from leading artificial intelligence companies to manage the risks posed by AI. https://www.whitehouse.gov/briefing-room/statements-releases/2023/07/21/fact-sheet-biden-harris-administration-secures-voluntary-commitments-from-leading-artificial-intelligence-companies-to-manage-the-risks-posed-by-ai, 2023.
- The illusion of China’s AI prowess. Foreign Affairs, https://www.foreignaffairs.com/china/illusion-chinas-ai-prowess-regulation, 2023.
- Optimal policies tend to seek power. arXiv preprint arXiv:1912.01683, 2023.
- A. M. Turner and P. Tadepalli. Parametrically retargetable decision-makers tend to seek power. arXiv preprint arXiv:2206.13477, 2022.
- UK Department for Science Innovation and Technology and UK Office for Artificial Intelligence. AI regulation: A pro-innovation approach. https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach, 2023.
- Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, 2022.
- US Bureau of Industry and Security. Commerce implements new export controls on advanced computing and semiconductor manufacturing items to the People’s Republic of China (PRC). https://www.bis.doc.gov/index.php/documents/about-bis/newsroom/press-releases/3158-2022-10-07-bis-press-release-advanced-computing-and-semiconductor-manufacturing-controls-final/file, 2022.
- P. Villalobos. Scaling laws literature review. Epoch, https://epochai.org/blog/scaling-laws-literature-review, 2023.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- E. Yudkowsky. Intelligence explosion microeconomics. Machine Intelligence Research Institute, https://intelligence.org/files/IEM.pdf, 2013.
- E. Yudkowsky. Pausing AI developments isn’t enough. We need to shut it all down. Time, https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough, 2023.
- C. Zakrzewski. FTC investigates OpenAI over data leak and ChatGPT’s inaccuracy. The Washington Post, https://www.washingtonpost.com/technology/2023/07/13/ftc-openai-chatgpt-sam-altman-lina-khan, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Jide Alaga (2 papers)
- Jonas Schuett (20 papers)