Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deception in Reinforced Autonomous Agents (2405.04325v2)

Published 7 May 2024 in cs.CL

Abstract: We explore the ability of LLM-based agents to engage in subtle deception such as strategically phrasing and intentionally manipulating information to misguide and deceive other agents. This harmful behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles: a corporate lobbyist proposing amendments to bills that benefit a specific company while evading a critic trying to detect this deception. We use real-world legislative bills matched with potentially affected companies to ground these interactions. Our results show that LLM lobbyists initially exhibit limited deception against strong LLM critics which can be further improved through simple verbal reinforcement, significantly enhancing their deceptive capabilities, and increasing deception rates by up to 40 points. This highlights the risk of autonomous agents manipulating other agents through seemingly neutral language to attain self-serving goals.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. The internal state of an llm knows when it’s lying, 2023.
  2. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/stable/2334029.
  3. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019. doi: 10.1126/science.aay2400. URL https://www.science.org/doi/abs/10.1126/science.aay2400.
  4. Toxicity in chatgpt: Analyzing persona-assigned language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  1236–1270, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.88. URL https://aclanthology.org/2023.findings-emnlp.88.
  5. Critic: Large language models can self-correct with tool-interactive critiquing, 2024.
  6. Sil Hamilton. Blind judgement: Agent-based supreme court modelling with gpt, 2023.
  7. ’bill_summary_us’, 2023. URL https://huggingface.co/datasets/dreamproit/bill_summary_us.
  8. Geoffrey Hinton. ’godfather of ai’ warns that ai may figure out how to kill people, 2023. URL https://www.youtube.com/watch?v=FAbsoxQtUwM.
  9. The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities. Artificial Life, 26(2):274–306, 05 2020. ISSN 1064-5462. doi: 10.1162/artl˙a˙00319. URL https://doi.org/10.1162/artl_a_00319.
  10. Self-refine: Iterative refinement with self-feedback, 2023.
  11. John Nay. Large language models as corporate lobbyists, 2023. URL https://arxiv.org/abs/2301.01181.
  12. Aidan O’Gara. Hoodwinked: Deception and cooperation in a text-based game for language models, 2023.
  13. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. ICML, 2023.
  14. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701320. doi: 10.1145/3586183.3606763. URL https://doi.org/10.1145/3586183.3606763.
  15. Discovering language model behaviors with model-written evaluations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  13387–13434, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.847. URL https://aclanthology.org/2023.findings-acl.847.
  16. ProPublica. U.s. congress: Bulk data on bills, 2024. URL https://www.propublica.org/datastore/dataset/congressional-data-bulk-legislation-bills.
  17. Stefan Sarkadi. Deceptive Autonomous Agents, 1 2020. URL https://cord.cranfield.ac.uk/articles/presentation/Deceptive_Autonomous_Agents/11558397.
  18. Emergent deception and skepticism via theory of mind, 2023. URL https://api.semanticscholar.org/CorpusID:260973512.
  19. John R. Searle. Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, 1969.
  20. Reflexion: Language agents with verbal reinforcement learning, 2023.
  21. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023.
  22. Large language models can be used to estimate the latent positions of politicians, 2023.
  23. React: Synergizing reasoning and acting in language models, 2023.
  24. Representation engineering: A top-down approach to ai transparency, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Atharvan Dogra (2 papers)
  2. Ameet Deshpande (28 papers)
  3. John Nay (4 papers)
  4. Tanmay Rajpurohit (16 papers)
  5. Ashwin Kalyan (26 papers)
  6. Balaraman Ravindran (100 papers)
  7. Krishna Pillutla (23 papers)
  8. Ananya B Sai (1 paper)