Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quantifying stability of non-power-seeking in artificial agents (2401.03529v1)

Published 7 Jan 2024 in cs.AI

Abstract: We investigate the question: if an AI agent is known to be safe in one setting, is it also safe in a new setting similar to the first? This is a core question of AI alignment--we train and test models in a certain environment, but deploy them in another, and we need to guarantee that models that seem safe in testing remain so in deployment. Our notion of safety is based on power-seeking--an agent which seeks power is not safe. In particular, we focus on a crucial type of power-seeking: resisting shutdown. We model agents as policies for Markov decision processes, and show (in two cases of interest) that not resisting shutdown is "stable": if an MDP has certain policies which don't avoid shutdown, the corresponding policies for a similar MDP also don't avoid shutdown. We also show that there are natural cases where safety is not stable--arbitrarily small perturbations may result in policies which never shut down. In our first case of interest--near-optimal policies--we use a bisimulation metric on MDPs to prove that small perturbations won't make the agent take longer to shut down. Our second case of interest is policies for MDPs satisfying certain constraints which hold for various models (including LLMs). Here, we demonstrate a quantitative bound on how fast the probability of not shutting down can increase: by defining a metric on MDPs; proving that the probability of not shutting down, as a function on MDPs, is lower semicontinuous; and bounding how quickly this function decreases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer, 2007. ISBN 9783540326960. URL https://books.google.ca/books?id=4hIq6ExH7NoC.
  2. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  3. Human control: Definitions and algorithms. arXiv preprint arXiv:2305.19861, 2023.
  4. Joseph Carlsmith. Is power-seeking AI an existential risk?, 2022.
  5. Using bisimulation for policy transfer in MDPs. Proceedings of the AAAI Conference on Artificial Intelligence, 24(1):1065–1070, Jul. 2010. URL https://ojs.aaai.org/index.php/AAAI/article/view/7751.
  6. Bisimulation for feller-dynkin processes. Electronic Notes in Theoretical Computer Science, 347:45–63, 2019.
  7. Ajeya Cotra. Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to.
  8. Self-locating beliefs. In Edward N. Zalta and Uri Nodelman, editors, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, 2022.
  9. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. Synthese, 198(Suppl 27):6435–6467, 2021.
  10. Metrics for finite Markov decision processes. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, page 162–169, Arlington, Virginia, USA, 2004. AUAI Press. ISBN 0974903906.
  11. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662–1714, 2011.
  12. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147(1):163–223, 2003. ISSN 0004-3702. URL https://www.sciencedirect.com/science/article/pii/S0004370202003764.
  13. AI control: Improving safety despite intentional subversion. arXiv preprint arXiv:2312.06942, 2023.
  14. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019.
  15. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  16. Towards robust bisimulation metric learning. Advances in Neural Information Processing Systems, 34:4764–4777, 2021.
  17. Power-seeking can be probable and predictive for trained agents, 2023.
  18. Metrics and continuity in reinforcement learning, 2021.
  19. How RL agents behave when their actions are modified. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35 number 13, pages 11586–11594, 2021.
  20. Goal misgeneralization in deep reinforcement learning. In International Conference on Machine Learning, pages 12004–12019. PMLR, 2022.
  21. Bisimulation through probabilistic testing. Information and Computation, 94(1):1–28, 1991. ISSN 0890-5401. doi: https://doi.org/10.1016/0890-5401(91)90030-6. URL https://www.sciencedirect.com/science/article/pii/0890540191900306.
  22. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities, 2019.
  23. Death and suicide in universal artificial intelligence. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9, pages 23–32. Springer, 2016.
  24. Luke Muehlhauser. Treacherous turns in the wild, 2021. URL https://lukemuehlhauser.com/treacherous-turns-in-the-wild/.
  25. Cleo Nardo. The Waluigi effect (mega-post), 2023. URL https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post.
  26. Richard Ngo. The alignment problem from a deep learning perspective. ArXiv, 2022. URL https://arxiv.org/abs/2209.00626.
  27. Safely interruptible agents. In Conference on Uncertainty in Artificial Intelligence, 2016.
  28. Toran Bruce Richards. AutoGPT, 2023. URL https://github.com/Significant-Gravitas/AutoGPT.
  29. Murray Shanahan. Talking about large language models. arXiv preprint arXiv:2212.03551, 2022.
  30. Role-play with large language models. arXiv preprint arXiv:2305.16367, 2023.
  31. Corrigibility. In Workshops at the 29th AAAI Conference, 2015.
  32. Measuring the distance between finite Markov decision processes. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, pages 468–476, 2016.
  33. Parametrically retargetable decision-makers tend to seek power, 2022.
  34. Optimal policies tend to seek power. arXiv preprint arXiv:1912.01683, 2019.
  35. Cédric Villani. The Wasserstein distances, pages 93–111. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-540-71050-9. doi: 10.1007/978-3-540-71050-9˙6. URL https://doi.org/10.1007/978-3-540-71050-9_6.

Summary

We haven't generated a summary for this paper yet.