Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task (2404.08424v2)

Published 12 Apr 2024 in cs.RO, cs.AI, and cs.HC

Abstract: Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using LLMs to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. E. Hildt, “What sort of robots do we want to interact with? reflecting on the human side of human-artificial intelligence interaction,” Frontiers in Computer Science, vol. 3, 2021.
  2. Y. Zhang and T. Doyle, “Integrating intention-based systems in human-robot interaction: a scoping review of sensors, algorithms, and trust,” Frontiers in Robotics and AI, vol. 10, 2023.
  3. A. Lubitz, L. Gutzeit, and F. Kirchner, “CoBaIR: A python library for context-based intention recognition in human-robot-interaction,” in 2023 32nd IEEE RO-MAN, 2023, pp. 2003–2009.
  4. K. Valmeekam, S. Sreedharan, M. Marquez, A. O. Hernandez, and S. Kambhampati, “On the planning abilities of large language models,” ArXiv, vol. abs/2302.06706, 2023.
  5. B. Zhang and H. Soh, “Large language models as zero-shot human models for human-robot interaction,” IEEE/RSJ IROS, pp. 7961–7968, 2023.
  6. S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal LLMs,” ArXiv, vol. abs/2401.06209, 2024.
  7. M. Kerzel, P. Allgeuer, E. Strahl, N. Frick, J.-G. Habekost, M. Eppe, and S. Wermter, “NICOL: A neuro-inspired collaborative semi-humanoid robot that bridges social interaction and reliable manipulation,” IEEE Access, vol. 11, pp. 123 531–123 542, 2023.
  8. S. Veselic, C. Zito, and D. Farina, “Human-robot interaction with robust prediction of movement intention surpasses manual control,” Frontiers in Neurorobotics, vol. 15, 2021.
  9. Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly,” ArXiv, vol. abs/2312.02003, 2024.
  10. B. AlKhamissi, M. Li, A. Celikyilmaz, M. Diab, and M. Ghazvininejad, “A review on language models as knowledge bases,” ArXiv, vol. abs/2204.06031, 2022.
  11. J. Liu, C. Yu, J. Gao, Y. Xie, Q. Liao, Y. Wu, and Y. Wang, “LLM-powered hierarchical language agent for real-time Human-AI coordination,” ArXiv, vol. abs/2312.15224, 2024.
  12. M. Jang, Y. Yoon, J. Choi, H. Ong, and J. Kim, “A structured prompting based on belief-desire-intention model for proactive and explainable task planning,” in Proceedings of the 11th International Conference on Human-Agent Interaction, 2023, p. 375–377.
  13. M. A. Graule and V. Isler, “GG-LLM: Geometrically grounding large language models for zero-shot human activity forecasting in human-aware task planning,” ArXiv, vol. abs/2310.20034, 2023.
  14. Q. Zhao, S. Wang, C. Zhang, C. Fu, M. Q. Do, N. Agarwal, K. Lee, and C. Sun, “AntGPT: Can large language models help long-term action anticipation from videos?” ArXiv, vol. abs/2307.16368, 2023.
  15. M. J. Q. Zhang and E. Choi, “Clarify when necessary: Resolving ambiguity through interaction with LMs,” ArXiv, vol. abs/2311.09469, 2023.
  16. T. Yoshida, A. Masumori, and T. Ikegami, “From text to motion: Grounding GPT-4 in a humanoid robot ”Alter3”,” ArXiv, vol. abs/2312.06571, 2023.
  17. N. Cherakara, F. Varghese, and S. S. et al., “FurChat: An embodied conversational agent using LLMs, combining open and closed-domain dialogue with facial expressions,” ArXiv, vol. abs/2308.15214, 2023.
  18. Y. K. Lee, Y. Jung, G. Kang, and S. Hahn, “Developing social robots with empathetic non-verbal cues using large language models,” ArXiv, vol. abs/2308.16529, 2023.
  19. G. J. Serfaty, V. O. Barnard, and J. P. Salisbury, “Generative facial expressions and eye gaze behavior from prompts for multi-human-robot interaction,” in Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, NY, USA, 2023.
  20. P. Allgeuer, H. Ali, and S. Wermter, “When robots get chatty: Grounding multimodal human-robot conversation and collaboration,” 2024, submitted to ICANN 2024.
  21. H. Ali, D. Jirak, and S. Wermter, “Snapture—a novel neural architecture for combined static and dynamic hand gesture recognition,” Cognitive Computation, vol. 15, no. 6, pp. 2014–2033, Nov 2023.
  22. C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee, and W.-T. C. et al., “MediaPipe: A framework for building perception pipelines,” ArXiv, vol. abs/1906.08172, 2019.
  23. H. Xu, E. G. Bazavan, A. Zanfir, B. Freeman, R. Sukthankar, and C. Sminchisescu, “GHUM & GHUML: Generative 3D human shape and articulated pose models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (Oral), 2020, pp. 6184–6193.
  24. V. Bazarevsky, I. Grishchenko, K. Raveendran, T. Zhu, F. Zhang, and M. Grundmann, “BlazePose: On-device real-time body pose tracking,” ArXiv, vol. abs/2006.10204, 2020.
  25. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Computer Vision – ECCV 2016.   Springer International Publishing, 2016, pp. 21–37.
  26. X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” ArXiv, vol. abs/2104.13921, 2022.
  27. A. Radford, J. W. Kim, T. Xu, and G. B. et al., “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, 2023, pp. 28 492–28 518.
  28. J.-G. Habekost, E. Strahl, P. Allgeuer, M. Kerzel, and S. Wermter, “Cycleik: Neuro-inspired inverse kinematics,” in ICANN 2023, Sep 2023, p. 457–470.
  29. M. Koskinopoulou, F. Raptopoulos, and G. P. et al., “Robotic waste sorting technology: Toward a vision-based categorization system for the industrial robotic separation of recyclable waste,” IEEE Robotics & Automation Magazine, vol. 28, no. 2, pp. 50–60, 2021.
  30. Z. Pan and K. Hauser, “Decision making in joint push-grasp action space for large-scale object sorting,” in IEEE ICRA, 2021, pp. 6199–6205.
  31. S. Ramadurai and H. Jeong, “Effect of human involvement on work performance and fluency in human-robot collaboration for recycling,” in ACM/IEEE International Conference on HRI, 2022, pp. 1007–1011.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hassan Ali (24 papers)
  2. Philipp Allgeuer (33 papers)
  3. Stefan Wermter (157 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com