Beyond Text: Utilizing Vocal Cues to Improve Decision Making in LLMs for Robot Navigation Tasks (2402.03494v3)
Abstract: While LLMs excel in processing text in these human conversations, they struggle with the nuances of verbal instructions in scenarios like social navigation, where ambiguity and uncertainty can erode trust in robotic and other AI systems. We can address this shortcoming by moving beyond text and additionally focusing on the paralinguistic features of these audio responses. These features are the aspects of spoken communication that do not involve the literal wording (lexical content) but convey meaning and nuance through how something is said. We present Beyond Text: an approach that improves LLM decision-making by integrating audio transcription along with a subsection of these features, which focus on the affect and more relevant in human-robot conversations.This approach not only achieves a 70.26% winning rate, outperforming existing LLMs by 22.16% to 48.30% (gemini-1.5-pro and gpt-3.5 respectively), but also enhances robustness against token manipulation adversarial attacks, highlighted by a 22.44% less decrease ratio than the text-only LLM in winning rate. Beyond Text' marks an advancement in social robot navigation and broader Human-Robot interactions, seamlessly integrating text-based guidance with human-audio-informed LLMs.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- From objects to landmarks: the function of visual location information in spatial navigation. Frontiers in psychology, 3:304, 2012.
- Lgmcts: Language-guided monte-carlo tree search for executable semantic object rearrangement. arXiv preprint arXiv:2309.15821, 2023a.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023b.
- Can a robot trust you?: A drl-based approach to trust-driven human-guided navigation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 3538–3545. IEEE, 2021.
- Exploiting deep semantics and compositionality of natural language for human-robot-interaction. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 731–738. IEEE, 2016.
- Foundation models in robotics: Applications, challenges, and the future. arXiv preprint arXiv:2312.07843, 2023.
- Principles and guidelines for evaluating social robot navigation algorithms. arXiv preprint arXiv:2306.16740, 2023.
- Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
- Golledge, R. G. Human wayfinding and cognitive maps. Colonization of unfamiliar landscapes: the archaeology of adaptation, 25, 2003.
- Listeners’ perceptions of the certainty and honesty of a speaker are associated with a common prosodic signature. nature communications, 2021.
- More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv e-prints, pp. arXiv–2302, 2023.
- Speech rate, intonation, and pitch: Investigating the bias and cue effects of vocal confidence on persuasion. Personality and Social Psychology Bulletin, 45(3):389–405, 2019.
- Safe navigation with human instructions in complex scenes. IEEE Robotics and Automation Letters, 4(2):753–760, 2019.
- The sound of confidence and doubt. Speech Communication, 88:106–126, 2017.
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. 2023.
- On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 854–860, 2018. doi: 10.1109/IROS.2018.8593571.
- Interactive task planning with language models, 2023.
- Bert-attack: Adversarial attack against bert using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6193–6202, 2020.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493–9500. IEEE, 2023.
- Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023a.
- Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187, 2023b.
- Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023a.
- Interactive robot learning from verbal correction, 2023b.
- Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724, 2023c.
- Neural constituency parsing of speech transcripts. arXiv preprint arXiv:1904.08535, 2019.
- Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
- Emotions don’t lie: An audio-visual deepfake detection method using affective cues. In Proceedings of the 28th ACM international conference on multimedia, pp. 2823–2832, 2020.
- Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–126, 2020.
- The use of acoustically detected filled and silent pauses in spontaneous speech recognition. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4305–4308. IEEE, 2009.
- The relations among wayfinding strategy use, sense of direction, sex, familiarity, and wayfinding ability. Journal of environmental psychology, 20(2):177–191, 2000.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
- Robots that ask for help: Uncertainty alignment for large language model planners. In Proceeding of 2023 Conference on Robot Learning (CoRL 2023), 2023.
- Semantically equivalent adversarial rules for debugging nlp models. In Annual Meeting of the Association for Computational Linguistics, 2018. URL https://api.semanticscholar.org/CorpusID:21740766.
- Navigation with large language models: Semantic guesswork as a heuristic for planning. In Conference on Robot Learning, pp. 2683–2699. PMLR, 2023.
- Plug and pray: Exploiting off-the-shelf components of multi-modal models. arXiv preprint arXiv:2307.14539, 2023.
- Lancar: Leveraging language for context-aware robot locomotion in unstructured environments. arXiv preprint arXiv:2310.00481, 2023.
- Sundar, S. S. Rise of machine agency: A framework for studying the psychology of human–ai interaction (haii). Journal of Computer-Mediated Communication, 25(1):74–88, 2020.
- Synthesising uncertainty: The interplay of vocal effort and hesitation disfluencies. In INTERSPEECH, pp. 804–808, 2017.
- Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Prompt a robot to walk with large language models. arXiv preprint arXiv:2309.09969, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Tidybot: Personalized robot assistance with large language models. arXiv preprint arXiv:2305.05658, 2023.
- In-context learning in large language models: A neuroscience-inspired analysis of representations. arXiv preprint arXiv:2310.00313, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.