Scheming AIs: Will AIs fake alignment during training in order to get power? (2311.08379v3)
Abstract: This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment"). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is roughly 25%). In particular: if performing well in training is a good strategy for gaining power (as I think it might well be), then a very wide variety of goals would motivate scheming -- and hence, good training performance. This makes it plausible that training might either land on such a goal naturally and then reinforce it, or actively push a model's motivations towards such a goal as an easy way of improving performance. What's more, because schemers pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. However, I also think there are reasons for comfort. In particular: scheming may not actually be such a good strategy for gaining power; various selection pressures in training might work against schemer-like goals (for example, relative to non-schemers, schemers need to engage in extra instrumental reasoning, which might harm their training performance); and we may be able to increase such pressures intentionally. The report discusses these and a wide variety of other considerations in detail, and it suggests an array of empirical research directions for probing the topic further.
- “Adept: Useful General Intelligence” URL: https://www.adept.ai/
- Anonymous “The Speed + Simplicity Prior is probably anti-deceptive” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/KSWSkxXJqWGd5jYLB/the-speed-simplicity-prior-is-probably-anti-deceptive
- “A General Language Assistant as a Laboratory for Alignment” type: article, 2021 arXiv: http://arxiv.org/abs/2112.00861
- “Relational inductive biases, deep learning, and graph networks” In CoRR abs/1806.01261, 2018 arXiv: http://arxiv.org/abs/1806.01261
- “Taken out of context: On measuring situational awareness in LLMs” type: article, 2023 arXiv: http://arxiv.org/abs/2309.00667
- Nick Bostrom “Superintelligence: Paths, Dangers, Strategies” Google-Books-ID: 7_H8AwAAQBAJ Oxford University Press, 2014
- “Propositions Concerning Digital Minds and Society”, 2022 URL: https://nickbostrom.com/propositions.pdf
- “Discovering Latent Knowledge in Language Models Without Supervision”, 2022 URL: https://arxiv.org/pdf/2212.03827.pdf
- “Consciousness in Artificial Intelligence: Insights from the Science of Consciousness” type: article, 2023 arXiv: http://arxiv.org/abs/2308.08708
- Steven Byrnes “Thoughts on “Process-Based Supervision”” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/D4gEDdqWrgDPMtasc/thoughts-on-process-based-supervision-1
- Joe Carlsmith “On the limits of idealized values”, 2023 URL: https://joecarlsmith.com/2021/06/21/on-the-limits-of-idealized-values
- Joe Carlsmith “The “no sandbagging on checkable tasks” hypothesis” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/h7QETH7GMk9HcMnHH/the-no-sandbagging-on-checkable-tasks-hypothesis
- Joseph Carlsmith “How Much Computational Power Does It Take to Match the Human Brain?”, 2020 URL: https://www.openphilanthropy.org/research/how-much-computational-power-does-it-take-to-match-the-human-brain/
- Joseph Carlsmith “Is Power-Seeking AI an Existential Risk?” type: article, 2021 arXiv: http://arxiv.org/abs/2206.13353
- Joseph Carlsmith “On the Universal Distribution”, 2022 URL: https://joecarlsmith.com/2021/10/29/on-the-universal-distribution#vi-simplicity-realism
- Joseph Carlsmith “Existential Risk from Power-Seeking AI” In Essays on Longtermism Oxford University Press, 2023
- Thomas Carr “Epoch or Episode: Understanding Terms in Deep Reinforcement Learning | Baeldung on Computer Science”, 2023 URL: https://www.baeldung.com/cs/epoch-vs-episode-reinforcement-learning
- Lawrence Chan “Shard Theory in Nine Theses: a Distillation and Critical Appraisal” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/8ccTZ9ZxpJrvnxt4F/shard-theory-in-nine-theses-a-distillation-and-critical
- Paul Christiano “What does the universal prior actually look like?”, 2016 URL: https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/
- Paul Christiano “What failure looks like” In Alignment Forum, 2019 URL: https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like
- Paul Christiano “Worst-case guarantees”, 2019 URL: https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d
- “Complexity of Value - LessWrong” URL: https://www.lesswrong.com/tag/complexity-of-value
- “Consequentialist cognition” URL: https://arbital.com/p/consequentialist/
- Ajeya Cotra “Supplement to “Why AI alignment could be hard””, 2021 URL: https://www.cold-takes.com/supplement-to-why-ai-alignment-could-be-hard/
- Ajeya Cotra “Why AI alignment could be hard with modern deep learning”, 2021 URL: https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/
- Ajeya Cotra “Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover” In Alignment Forum, 2022 URL: https://alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to
- Ajeya Cotra, Rob Wiblin and Luisa Rodriguez “Ajeya Cotra on accidentally teaching AI models to deceive us”, 2023 URL: https://80000hours.org/podcast/episodes/ajeya-cotra-accidentally-teaching-ai-to-deceive-us/
- “Evolution of the eye” Page Version ID: 1184555315 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Evolution_of_the_eye&oldid=1184555315
- “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 OpenReview.net, 2019 URL: https://arxiv.org/abs/1803.03635
- Scott Garrabrant “Goodhart Taxonomy” In Alignment Forum, 2017 URL: https://www.alignmentforum.org/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy
- Leo Geo “Clarifying wireheading terminology”, 2022 URL: https://www.alignmentforum.org/posts/REesy8nqvknFFKywm/clarifying-wireheading-terminology
- Ryan Greenblatt “Improving the Welfare of AIs: A Nearcasted Proposal”, 2023 URL: https://www.lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal
- “Path dependence in ML inductive biases” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/bxkWd6WdkPqGmdHEk/path-dependence-in-ml-inductive-biases
- “How many words are there in English? | Merriam-Webster” URL: https://www.merriam-webster.com/help/faq-how-many-english-words
- Evan Hubinger “Gradient hacking” In Alignment Forum, 2019 URL: https://www.alignmentforum.org/posts/uXH4r6MmKPedk8rMA/gradient-hacking
- Evan Hubinger “Inductive biases stick around” In Alignment Forum, 2019 URL: https://www.alignmentforum.org/posts/nGqzNC6uNueum2w8T/inductive-biases-stick-around
- Evan Hubinger “Understanding “Deep Double Descent”” In Alignment Forum, 2019 URL: https://www.alignmentforum.org/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent
- Evan Hubinger “Homogeneity vs. heterogeneity in AI takeoff scenarios” In Alignment Forum, 2020 URL: https://www.alignmentforum.org/posts/mKBfa8v4S9pNKSyKK/homogeneity-vs-heterogeneity-in-ai-takeoff-scenarios
- Evan Hubinger “A transparency and interpretability tech tree” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/nbq2bWLcYmSGup9aF/a-transparency-and-interpretability-tech-tree
- Evan Hubinger “How likely is deceptive alignment?” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment
- Evan Hubinger “Monitoring for deceptive alignment” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/Km9sHjHTsBdbgwKyi/monitoring-for-deceptive-alignment
- Evan Hubinger “Sticky goals: a concrete experiment for understanding deceptive alignment” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/a2Bxq4g2sPZwKiQmK/sticky-goals-a-concrete-experiment-for-understanding
- Evan Hubinger “When can we trust model evaluations?” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations
- “Conditioning Predictive Models: Risks and Strategies” type: article, 2023 arXiv: http://arxiv.org/abs/2302.00805
- “Conditioning Predictive Models: Making inner alignment as easy as possible” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/qoHwKgLFfPcEuwaba/conditioning-predictive-models-making-inner-alignment-as
- “Risks from Learned Optimization in Advanced Machine Learning Systems” type: article, 2019 arXiv: http://arxiv.org/abs/1906.01820
- “Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1
- “Huffman coding” Page Version ID: 1175907130 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Huffman_coding&oldid=1175907130
- Marcus Hutter “Algorithmic complexity” In Scholarpedia 3.1, 2008, pp. 2573 DOI: 10.4249/scholarpedia.2573
- “Inclusive fitness” Page Version ID: 1149059235 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Inclusive_fitness&oldid=1149059235
- “Inductive bias” Page Version ID: 1178867299 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Inductive_bias&oldid=1178867299
- Alex Irpan “Deep Reinforcement Learning Doesn’t Work Yet”, 2018 URL: http://www.alexirpan.com/2018/02/14/rl-hard.html
- Max Jaderberg “Population based training of neural networks”, 2017 URL: https://deepmind.google/discover/blog/population-based-training-of-neural-networks/
- Janus “How LLMs are and are not myopic” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/c68SJsBpiAxkPwRHj/how-llms-are-and-are-not-myopic
- Holden Karnofsky “AI Safety Seems Hard to Measure”, 2022 URL: https://www.cold-takes.com/ai-safety-seems-hard-to-measure/
- Holden Karnofsky “AI strategy nearcasting” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/Qo2EkG3dEMv8GnX8d/ai-strategy-nearcasting
- Holden Karnofsky “How might we align transformative AI if it’s developed very soon?” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/rCJQAkPTEypGjSJ8X/how-might-we-align-transformative-ai-if-it-s-developed-very
- Holden Karnofsky “3 levels of threat obfuscation” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/HpzHjKjGQ4cKiY3jX/3-levels-of-threat-obfuscation
- Holden Karnofsky “Discussion with Nate Soares on a key alignment difficulty” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty
- “Clarifying AI X-risk”, 2022 URL: https://www.alignmentforum.org/posts/GctJD5oCDRxCspEaZ/clarifying-ai-x-risk
- “Threat Model Literature Review”, 2022 URL: https://www.alignmentforum.org/posts/wnnkD6P2k2TfHnNmt/threat-model-literature-review
- David Krueger, Tegan Maharaj and Jan Leike “Hidden Incentives for Auto-Induced Distributional Shift” type: article, 2020 arXiv: http://arxiv.org/abs/2009.09153
- Joshua Landau “Optimality is the tiger, and agents are its teeth” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth
- “Goal Misgeneralization in Deep Reinforcement Learning” type: article, 2023 arXiv: http://arxiv.org/abs/2105.14111
- “Measuring Faithfulness in Chain-of-Thought Reasoning” type: article, 2023 arXiv: http://arxiv.org/abs/2307.13702
- Jan Leike “Self-exfiltration is a key dangerous capability”, 2023 URL: https://aligned.substack.com/p/self-exfiltration
- Jan Leike, John Schulman and Jeffrey Wu “Our approach to alignment research”, 2022 URL: https://openai.com/blog/our-approach-to-alignment-research
- “Lempel–Ziv complexity” Page Version ID: 1175701547 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Lempel%E2%80%93Ziv_complexity&oldid=1175701547
- “Longtermism” Page Version ID: 1182123936 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Longtermism&oldid=1182123936
- LRudL “Understanding and controlling auto-induced distributional shift” In Alignment Forum, 2021 URL: https://www.alignmentforum.org/posts/rTYGMbmEsFkxyyXuR/understanding-and-controlling-auto-induced-distributional
- R.Thomas McCoy, Junghyun Min and Tal Linzen “BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance” type: article, 2020 arXiv: http://arxiv.org/abs/1911.02969
- “Meta-learning (computer science)” Page Version ID: 1176940930 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Meta-learning_(computer_science)&oldid=1176940930
- Chris Mingard “Neural networks are fundamentally (almost) Bayesian”, 2020 URL: https://towardsdatascience.com/neural-networks-are-fundamentally-bayesian-bee9a172fad8
- Chris Mingard “Deep Neural Networks are biased, at initialisation, towards simple functions”, 2021 URL: https://towardsdatascience.com/deep-neural-networks-are-biased-at-initialisation-towards-simple-functions-a63487edcb99
- “Is SGD a Bayesian sampler? Well, almost” type: article, 2020 arXiv: http://arxiv.org/abs/2006.15191
- Richard Ngo “Richard Ngo’s Shortform”, 2022 URL: https://www.lesswrong.com/posts/FuGfR3jL3sw6r8kB4/richard-ngo-s-shortform
- Richard Ngo, Lawrence Chan and Sören Mindermann “The alignment problem from a deep learning perspective” type: article, 2023 arXiv: http://arxiv.org/abs/2209.00626
- “Occam’s razor” Page Version ID: 1184501671 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Occam%27s_razor&oldid=1184501671
- Chris Olah “Visual Information Theory”, 2015 URL: https://colah.github.io/posts/2015-09-Visual-Information/
- “In-context Learning and Induction Heads”, 2022 URL: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
- Stephen M. Omohundro “The Basic AI Drives” In Proceedings of the 2008 conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference NLD: IOS Press, 2008, pp. 483–492 URL: https://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf
- “Open Philanthropy AI Worldviews Contest”, 2022 URL: https://www.openphilanthropy.org/open-philanthropy-ai-worldviews-contest/
- “AI Deception: A Survey of Examples, Risks, and Potential Solutions” type: article, 2023 arXiv: http://arxiv.org/abs/2308.14752
- Dwarkesh Patel “Carl Shulman (Pt 2) - AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity’s Far Future”, 2023 URL: https://www.dwarkeshpatel.com/p/carl-shulman-2
- “Carl Shulman (Pt 1) - Intelligence Explosion, Primate Evolution, Robot Doublings, & Alignment”, 2023 URL: https://www.dwarkeshpatel.com/p/carl-shulman
- Kelsey Piper “Playing the training game”, 2023 URL: https://www.planned-obsolescence.org/the-training-game/
- “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets” In CoRR abs/2201.02177, 2022 arXiv: https://arxiv.org/abs/2201.02177
- “Regularization (mathematics)” Page Version ID: 1177271127 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Regularization_(mathematics)&oldid=1177271127
- “Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches” In CoRR abs/1803.09578, 2018 arXiv: http://arxiv.org/abs/1803.09578
- José Luis Ricón “The situational awareness assumption in AI risk discourse, or why people should chill” In Nintil, 2023 URL: https://nintil.com/situational-awareness-agi
- Sam Ringer “Models Don’t "Get Reward"” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
- “Preventing Language Models From Hiding Their Reasoning” type: article, 2023 arXiv: http://arxiv.org/abs/2310.18512
- “Rotating locomotion in living systems” Page Version ID: 1175051727 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Rotating_locomotion_in_living_systems&oldid=1175051727
- “RSA numbers” Page Version ID: 1183413403 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=RSA_numbers&oldid=1183413403#RSA-2048
- Maximilian Schreiner “GPT-4 architecture, datasets, costs and more leaked”, 2023 URL: https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/
- “Semiprime” Page Version ID: 1154324640 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Semiprime&oldid=1154324640
- Rohin Shah “Comment on: Understanding “Deep Double Descent””, 2019 URL: https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent
- “Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals” type: article, 2022 arXiv: http://arxiv.org/abs/2210.01790
- “Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/MbWWKbyD5gLhJgfwn/meta-level-adversarial-evaluation-of-oversight-techniques-1
- Joar Skalse “Comment on: Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian”, 2021 URL: https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of
- Nate Soares “A central AI alignment problem: capabilities generalization, and the sharp left turn” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization
- Nate Soares “Deep Deceptiveness” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/XWwvwytieLtEWaFJX/deep-deceptiveness
- Nate Soares “What I mean by "alignment is in large part about making cognition aimable at all"” In Alignment Forum, 2023 URL: https://www.alignmentforum.org/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making
- “sphexish” Page Version ID: 75894036 In Wiktionary, the free dictionary, 2023 URL: https://en.wiktionary.org/w/index.php?title=sphexish&oldid=75894036
- Jacob Steinhardt “ML Systems Will Have Weird Failure Modes”, 2022 URL: https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/
- Adept Team “ACT-1: Transformer for Actions”, 2022 URL: https://www.adept.ai/blog/act-1/
- The AlphaStar team “AlphaStar: Mastering the real-time strategy game StarCraft II”, 2019 URL: https://deepmind.google/discover/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii/
- Alex Turner “Inner and outer alignment decompose one hard problem into two extremely hard problems”, 2022 URL: https://www.alignmentforum.org/posts/gHefoxiznGfsbiAu9/inner-and-outer-alignment-decompose-one-hard-problem-into
- Alex Turner “Reward is not the optimization target” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target
- “Universal Turing machine” Page Version ID: 1183200306 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Universal_Turing_machine&oldid=1183200306
- Guillermo Valle-Pérez, Chico Q. Camargo and Ard A. Louis “Deep learning generalizes because the parameter-function map is biased towards simple functions” type: article arXiv, 2019 arXiv: http://arxiv.org/abs/1805.08522
- “Wabi-sabi” Page Version ID: 1184445631 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Wabi-sabi&oldid=1184445631
- Lilian Weng “LLM Powered Autonomous Agents”, 2023 URL: https://lilianweng.github.io/posts/2023-06-23-agent/
- David Wheaton “Deceptive Alignment is <1% Likely by Default” In Alignment Forum, 2023 URL: https://forum.effectivealtruism.org/posts/4MTwLjzPeaNyXomnx/deceptive-alignment-is-less-than-1-likely-by-default
- Hayden Wilkinson “In Defense of Fanaticism” Publisher: The University of Chicago Press In Ethics 132.2, 2022, pp. 445–477 DOI: 10.1086/716869
- “Wirehead (science fiction)” Page Version ID: 1177998718 In Wikipedia, 2023 URL: https://en.wikipedia.org/w/index.php?title=Wirehead_(science_fiction)&oldid=1177998718
- Xiaoxia Wu, Ethan Dyer and Behnam Neyshabur “When Do Curricula Work?” In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 OpenReview.net, 2021 URL: https://openreview.net/forum?id=tW4QEInpni
- Mark Xu “Does SGD Produce Deceptive Alignment?” In Alignment Forum, 2020 URL: https://www.alignmentforum.org/posts/ocWqg2Pf2br4jMmKA/does-sgd-produce-deceptive-alignment
- Mark Xu “Strong Evidence is Common”, 2021 URL: https://markxu.com/strong-evidence
- Eliezer Yudkowsky “Value is Fragile” In Less Wrong, 2009 URL: https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-fragile
- Eliezer Yudkowsky “Parfit’s Hitchhiker”, 2016 URL: https://arbital.com/p/parfits_hitchhiker/
- Eliezer Yudkowsky “Comment on: A positive case for how we might succeed at prosaic AI alignment”, 2021 URL: https://www.alignmentforum.org/posts/5ciYedyQDDqAcrDLr/a-positive-case-for-how-we-might-succeed-at-prosaic-ai
- Eliezer Yudkowsky “Comment on: Why I’m excited about Debate”, 2021 URL: https://www.lesswrong.com/posts/LDsSqXf9Dpu3J3gHD/why-i-m-excited-about-debate
- Eliezer Yudkowsky “AGI Ruin: A List of Lethalities” In Alignment Forum, 2022 URL: https://www.alignmentforum.org/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
- Eliezer Yudkowsky “Context disaster” URL: https://arbital.com/p/context_disaster/
- Eliezer Yudkowsky “Corrigibility” URL: https://arbital.com/p/corrigibility/
- Eliezer Yudkowsky “Dark Side Epistemology” URL: https://www.lesswrong.com/posts/XTWkjCJScy2GFAgDt/dark-side-epistemology
- Eliezer Yudkowsky “Extrapolated volition (normative moral theory)” URL: https://arbital.com/p/normative_extrapolated_volition/
- Eliezer Yudkowsky “Logical decision theories” URL: https://arbital.com/p/logical_dt/?l=5kv
- Eliezer Yudkowsky “Omnipotence test for AI safety” URL: https://arbital.com/p/omni_test/
- Eliezer Yudkowsky “Paperclip” URL: https://arbital.com/p/paperclip/
- Eliezer Yudkowsky “Pivotal act” URL: https://arbital.com/p/pivotal/
- “Ngo and Yudkowsky on alignment difficulty” In Alignment Forum 2021 MIRI Conversations, 2021 URL: https://www.alignmentforum.org/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty
- Shaked Zychlinski “The Complete Reinforcement Learning Dictionary”, 2019 URL: https://towardsdatascience.com/the-complete-reinforcement-learning-dictionary-e16230b7d24e
- Joe Carlsmith (1 paper)