Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are Large Language Models Aligned with People's Social Intuitions for Human-Robot Interactions? (2403.05701v2)

Published 8 Mar 2024 in cs.RO, cs.AI, and cs.HC

Abstract: LLMs are increasingly used in robotics, especially for high-level action planning. Meanwhile, many robotics applications involve human supervisors or collaborators. Hence, it is crucial for LLMs to generate socially acceptable actions that align with people's preferences and values. In this work, we test whether LLMs capture people's intuitions about behavior judgments and communication preferences in human-robot interaction (HRI) scenarios. For evaluation, we reproduce three HRI user studies, comparing the output of LLMs with that of real participants. We find that GPT-4 strongly outperforms other models, generating answers that correlate strongly with users' answers in two studies $\unicode{x2014}$ the first study dealing with selecting the most appropriate communicative act for a robot in various situations ($r_s$ = 0.82), and the second with judging the desirability, intentionality, and surprisingness of behavior ($r_s$ = 0.83). However, for the last study, testing whether people judge the behavior of robots and humans differently, no model achieves strong correlations. Moreover, we show that vision models fail to capture the essence of video stimuli and that LLMs tend to rate different communicative acts and behavior desirability higher than people.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. S. Honig and T. Oron-Gilad, “Understanding and Resolving Failures in Human-Robot Interaction: Literature Review and Model Development,” Frontiers in Psychology, 2018.
  2. K. Dautenhahn, “Socially Intelligent Robots: Dimensions of Human–Robot Interaction,” Phil. Trans. R. Soc. B, 2007.
  3. A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., “Do as I Can, Not as I Say: Grounding Language in Robotic Affordances,” in CoRL, 2023.
  4. B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al., “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control,” in CoRL, 2023.
  5. L. Wachowiak, A. Fenn, H. Kamran, A. Coles, O. Celiktutan, and G. Canal, “When Do People Want an Explanation from a Robot?,” in HRI, 2024.
  6. M. M. A. de Graaf and B. F. Malle, “People’s Explanations of Robot Behavior Subtly Reveal Mental State Inferences,” in HRI, 2020.
  7. M. M. de Graaf and B. F. Malle, “People’s Judgments of Human and Robot Behaviors: A Robust Set of Behaviors and Some Discrepancies,” in Companion of HRI, 2018.
  8. S. Russell, Human Compatible: Artificial Intelligence and the Problem of Control. Penguin, 2019.
  9. I. Gabriel, “Artificial Intelligence, Values, and Alignment,” Minds and machines, 2020.
  10. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  11. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” 2022.
  12. T. Williams, C. Matuszek, R. Mead, and N. Depalma, “Scarecrows in Oz: The Use of Large Language Models in HRI,” THRI, 2024.
  13. C. Y. Kim, C. P. Lee, and B. Mutlu, “Understanding Large-Language Model (LLM)-powered Human-Robot Interaction,” in HRI, 2024.
  14. B. Zhang and H. Soh, “Large Language Models as Zero-Shot Human Models for Human-Robot Interaction,” in IROS, 2023.
  15. D. Dillion, N. Tandon, Y. Gu, and K. Gray, “Can AI Language Models Replace Human Participants?,” Trends in Cognitive Sciences, 2023.
  16. J. Harding, W. D’Alessandro, N. Laskowski, and R. Long, “AI Language Models Cannot Replace Human Research Participants,” AI & SOCIETY, 2023.
  17. G. V. Aher, R. I. Arriaga, and A. T. Kalai, “Using large language models to simulate multiple humans and replicate human subject studies,” in ICML, 2023.
  18. C. Houghton, N. Kazanina, and P. Sukumaran, “Beyond the Limitations of any Imaginable Mechanism: Large Language Models and Psycholinguistics,” arXiv preprint arXiv:2303.00077, 2023.
  19. P. Wicke and L. Wachowiak, “Exploring Spatial Schema Intuitions in Large Language and Vision Models,” arXiv preprint arXiv:2402.00956, 2024.
  20. S. J. Amouyal, A. Meltzer-Asscher, and J. Berant, “Large Language Models for Psycholinguistic Plausibility Pretesting,” arXiv preprint arXiv:2402.05455, 2024.
  21. T. Hagendorff, S. Fabi, and M. Kosinski, “Machine Intuition: Uncovering Human-like Intuitive Decision-Making in GPT-3.5,” arXiv preprint arXiv:2212.05206, 2022.
  22. I. Dasgupta, A. K. Lampinen, S. C. Chan, A. Creswell, D. Kumaran, J. L. McClelland, and F. Hill, “Language models show human-like content effects on reasoning,” arXiv preprint arXiv:2207.07051, 2022.
  23. Z. Ma, J. Sansom, R. Peng, and J. Chai, “Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models,” in EMNLP Findings, 2023.
  24. L. E. Ruis, A. Khan, S. Biderman, S. Hooker, T. Rocktäschel, and E. Grefenstette, “The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs,” in NeurIPS, 2023.
  25. M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi, “Social IQa: Commonsense Reasoning about Social Interactions,” in EMNLP-IJCNLP, 2019.
  26. M. van Duijn, B. van Dijk, T. Kouwenhoven, W. de Valk, M. Spruit, and P. vanderPutten, “Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests,” in CoNLL, 2023.
  27. K. X. Nguyen, “Language Models are Bounded Pragmatic Speakers,” in First Workshop on Theory of Mind in Communicating Agents, 2023.
  28. M. Verma, S. Bhambri, and S. Kambhampati, “Theory of Mind Abilities of Large Language Models in Human-Robot Interaction: An Illusion?,” arXiv preprint arXiv:2401.05302, 2024.
  29. C. Spearman, “The proof and measurement of association between two things,” The American Journal of Psychology, 1904.
  30. H. Akoglu, “User’s guide to correlation coefficients,” Turkish Journal of Emergency Medicine, 2018.
  31. Y. Benjamini and Y. Hochberg, “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,” Journal of the Royal Statistical Society, 1995.
  32. OpenAI, “GPT-4 Technical Report,” 2023.
  33. “KCL CREATE,” 2024. doi.org/10.18742/rnvf-m076.
  34. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” NeurIPS, 2022.
  35. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large Language Models are Zero-shot Reasoners,” NeurIPS, 2022.
  36. J. Chen, L. Chen, H. Huang, and T. Zhou, “When Do You Need Chain-of-Thought Prompting for ChatGPT?,” arXiv preprint arXiv:2304.03262, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Lennart Wachowiak (2 papers)
  2. Andrew Coles (2 papers)
  3. Oya Celiktutan (18 papers)
  4. Gerard Canal (1 paper)