2000 character limit reached
Experiments with Detecting and Mitigating AI Deception (2306.14816v1)
Published 26 Jun 2023 in cs.AI
Abstract: How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.
- AAAI 36(9), pp. 9529–9538, 10.1609/aaai.v36i9.21186.
- In Sheila A. McIlraith & Kilian Q. Weinberger, editors: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, AAAI Press, pp. 2669–2678. Available at https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17211.
- Hal Ashton (2022): Definitions of intent suitable for algorithms. Artificial Intelligence and Law, pp. 1–32.
- Nick Bostrom (2017): Superintelligence. Dunod.
- Joseph Carlsmith (2022): Is Power-Seeking AI an Existential Risk? CoRR abs/2206.13353, 10.48550/arXiv.2206.13353. arXiv:https://arxiv.org/abs/2206.13353.
- In: AAAI.
- In Frank Dignum, Alessio Lomuscio, Ulle Endriss & Ann Nowé, editors: AAMAS ’21: 20th International Conference on Autonomous Agents and Multiagent Systems, Virtual Event, United Kingdom, May 3-7, 2021, ACM, pp. 483–491, 10.5555/3463952.3464013. Available at https://www.ifaamas.org/Proceedings/aamas2021/pdfs/p483.pdf.
- arXiv preprint arXiv:2101.08153.
- In Sheila A. McIlraith & Kilian Q. Weinberger, editors: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, AAAI Press, pp. 1853–1860. Available at https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16824.
- Artificial Intelligence 320, p. 103919, 10.1016/j.artint.2023.103919.
- arXiv preprint arXiv:2208.08345.
- In Jean-François Raskin, Krishnendu Chatterjee, Laurent Doyen & Rupak Majumdar, editors: Principles of Systems Design - Essays Dedicated to Thomas A. Henzinger on the Occasion of His 60th Birthday, Lecture Notes in Computer Science 13660, Springer, pp. 650–663, 10.1007/978-3-031-22337-2_31. Available at https://doi.org/10.1007/978-3-031-22337-2_31.
- James Edwin Mahon (2016): The Definition of Lying and Deception. In Edward N. Zalta, editor: The Stanford Encyclopedia of Philosophy, Winter 2016 edition, Metaphysics Research Lab, Stanford University.
- Haritz Odriozola-Olalde, Maider Zamalloa & Nestor Arana-Arexolaleiba (2023): Shielded Reinforcement Learning: A review of reactive methods for safe learning. In: IEEE/SICE International Symposium on System Integration, SII 2023, Atlanta, GA, USA, January 17-20, 2023, IEEE, pp. 1–8, 10.1109/SII55687.2023.10039301. Available at https://doi.org/10.1109/SII55687.2023.10039301.
- OpenAI (2023): GPT-4 Technical Report. CoRR abs/2303.08774, 10.48550/arXiv.2303.08774. arXiv:https://arxiv.org/abs/2303.08774.
- Stuart Russell (2019): Human compatible: Artificial intelligence and the problem of control. Penguin.
- In: Automated Technology for Verification and Analysis: 20th International Symposium, ATVA 2022, Virtual Event, October 25–28, 2022, Proceedings, Springer, pp. 25–41.
- Ismail Sahbane (1 paper)
- Francis Rhys Ward (9 papers)
- C Henrik Åslund (1 paper)