Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task (2404.08424v2)
Abstract: Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using LLMs to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot.
- E. Hildt, “What sort of robots do we want to interact with? reflecting on the human side of human-artificial intelligence interaction,” Frontiers in Computer Science, vol. 3, 2021.
- Y. Zhang and T. Doyle, “Integrating intention-based systems in human-robot interaction: a scoping review of sensors, algorithms, and trust,” Frontiers in Robotics and AI, vol. 10, 2023.
- A. Lubitz, L. Gutzeit, and F. Kirchner, “CoBaIR: A python library for context-based intention recognition in human-robot-interaction,” in 2023 32nd IEEE RO-MAN, 2023, pp. 2003–2009.
- K. Valmeekam, S. Sreedharan, M. Marquez, A. O. Hernandez, and S. Kambhampati, “On the planning abilities of large language models,” ArXiv, vol. abs/2302.06706, 2023.
- B. Zhang and H. Soh, “Large language models as zero-shot human models for human-robot interaction,” IEEE/RSJ IROS, pp. 7961–7968, 2023.
- S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal LLMs,” ArXiv, vol. abs/2401.06209, 2024.
- M. Kerzel, P. Allgeuer, E. Strahl, N. Frick, J.-G. Habekost, M. Eppe, and S. Wermter, “NICOL: A neuro-inspired collaborative semi-humanoid robot that bridges social interaction and reliable manipulation,” IEEE Access, vol. 11, pp. 123 531–123 542, 2023.
- S. Veselic, C. Zito, and D. Farina, “Human-robot interaction with robust prediction of movement intention surpasses manual control,” Frontiers in Neurorobotics, vol. 15, 2021.
- Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly,” ArXiv, vol. abs/2312.02003, 2024.
- B. AlKhamissi, M. Li, A. Celikyilmaz, M. Diab, and M. Ghazvininejad, “A review on language models as knowledge bases,” ArXiv, vol. abs/2204.06031, 2022.
- J. Liu, C. Yu, J. Gao, Y. Xie, Q. Liao, Y. Wu, and Y. Wang, “LLM-powered hierarchical language agent for real-time Human-AI coordination,” ArXiv, vol. abs/2312.15224, 2024.
- M. Jang, Y. Yoon, J. Choi, H. Ong, and J. Kim, “A structured prompting based on belief-desire-intention model for proactive and explainable task planning,” in Proceedings of the 11th International Conference on Human-Agent Interaction, 2023, p. 375–377.
- M. A. Graule and V. Isler, “GG-LLM: Geometrically grounding large language models for zero-shot human activity forecasting in human-aware task planning,” ArXiv, vol. abs/2310.20034, 2023.
- Q. Zhao, S. Wang, C. Zhang, C. Fu, M. Q. Do, N. Agarwal, K. Lee, and C. Sun, “AntGPT: Can large language models help long-term action anticipation from videos?” ArXiv, vol. abs/2307.16368, 2023.
- M. J. Q. Zhang and E. Choi, “Clarify when necessary: Resolving ambiguity through interaction with LMs,” ArXiv, vol. abs/2311.09469, 2023.
- T. Yoshida, A. Masumori, and T. Ikegami, “From text to motion: Grounding GPT-4 in a humanoid robot ”Alter3”,” ArXiv, vol. abs/2312.06571, 2023.
- N. Cherakara, F. Varghese, and S. S. et al., “FurChat: An embodied conversational agent using LLMs, combining open and closed-domain dialogue with facial expressions,” ArXiv, vol. abs/2308.15214, 2023.
- Y. K. Lee, Y. Jung, G. Kang, and S. Hahn, “Developing social robots with empathetic non-verbal cues using large language models,” ArXiv, vol. abs/2308.16529, 2023.
- G. J. Serfaty, V. O. Barnard, and J. P. Salisbury, “Generative facial expressions and eye gaze behavior from prompts for multi-human-robot interaction,” in Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, NY, USA, 2023.
- P. Allgeuer, H. Ali, and S. Wermter, “When robots get chatty: Grounding multimodal human-robot conversation and collaboration,” 2024, submitted to ICANN 2024.
- H. Ali, D. Jirak, and S. Wermter, “Snapture—a novel neural architecture for combined static and dynamic hand gesture recognition,” Cognitive Computation, vol. 15, no. 6, pp. 2014–2033, Nov 2023.
- C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee, and W.-T. C. et al., “MediaPipe: A framework for building perception pipelines,” ArXiv, vol. abs/1906.08172, 2019.
- H. Xu, E. G. Bazavan, A. Zanfir, B. Freeman, R. Sukthankar, and C. Sminchisescu, “GHUM & GHUML: Generative 3D human shape and articulated pose models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (Oral), 2020, pp. 6184–6193.
- V. Bazarevsky, I. Grishchenko, K. Raveendran, T. Zhu, F. Zhang, and M. Grundmann, “BlazePose: On-device real-time body pose tracking,” ArXiv, vol. abs/2006.10204, 2020.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Computer Vision – ECCV 2016. Springer International Publishing, 2016, pp. 21–37.
- X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” ArXiv, vol. abs/2104.13921, 2022.
- A. Radford, J. W. Kim, T. Xu, and G. B. et al., “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, 2023, pp. 28 492–28 518.
- J.-G. Habekost, E. Strahl, P. Allgeuer, M. Kerzel, and S. Wermter, “Cycleik: Neuro-inspired inverse kinematics,” in ICANN 2023, Sep 2023, p. 457–470.
- M. Koskinopoulou, F. Raptopoulos, and G. P. et al., “Robotic waste sorting technology: Toward a vision-based categorization system for the industrial robotic separation of recyclable waste,” IEEE Robotics & Automation Magazine, vol. 28, no. 2, pp. 50–60, 2021.
- Z. Pan and K. Hauser, “Decision making in joint push-grasp action space for large-scale object sorting,” in IEEE ICRA, 2021, pp. 6199–6205.
- S. Ramadurai and H. Jeong, “Effect of human involvement on work performance and fluency in human-robot collaboration for recycling,” in ACM/IEEE International Conference on HRI, 2022, pp. 1007–1011.
- Hassan Ali (24 papers)
- Philipp Allgeuer (33 papers)
- Stefan Wermter (157 papers)