SGGNet$^2$: Speech-Scene Graph Grounding Network for Speech-guided Navigation (2307.07468v2)
Abstract: The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately grounding language utterances gives a significant challenge due to the acoustic variability in speakers' voices and environmental noise. In this work, we propose a novel speech-scene graph grounding network (SGGNet$2$) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. To incorporate the acoustic similarity, we extend our previous grounding model, the scene-graph-based grounding network (SGGNet), with the ASR model from NVIDIA NeMo. We accomplish this by feeding the latent vector of speech pronunciations into the BERT-based grounding network within SGGNet. We evaluate the effectiveness of using latent vectors of speech commands in grounding through qualitative and quantitative studies. We also demonstrate the capability of SGGNet$2$ in a speech-based navigation task using a real quadruped robot, RBQ-3, from Rainbow Robotics.
- D. Park, Y. Hoshi, and C. C. Kemp, “A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder,” IEEE Robotics and Automation Letters, 2018.
- A. Kapusta, D. Park, and C. C. Kemp, “Task-centric selection of robot and environment initial configurations for assistive tasks,” in Proc. RSJ Int’l Conf. on Intelligent Robots and Systems, IEEE, 2015.
- A. S. Kapusta, P. M. Grice, H. M. Clever, Y. Chitalia, D. Park, and C. C. Kemp, “A system for bedside assistance that integrates a robotic bed and a mobile manipulator,” Plos one, 2019.
- D. Nyga, S. Roy, R. Paul, D. Park, M. Pomarlan, M. Beetz, and N. Roy, “Grounding robot plans from natural language instructions with incomplete world knowledge,” in Proc. Conf. on Robot Learning, PMLR, 2018.
- T. Howard, E. Stump, J. Fink, J. Arkin, R. Paul, D. Park, S. Roy, et al., “An intelligence architecture for grounded language communication with field robots,” Field Robotics, 2022.
- D. Kim, J. Kim, M. Cho, and D. Park, “Natural language-guided semantic navigation using scene graph,” in Proc. Robot Intelligence Technology and Applications, Springer, 2023.
- S. Roy, M. Noseworthy, R. Paul, D. Park, and N. Roy, “Leveraging past references for robust language grounding,” in Proc. Conf. on Computational Natural Language Learning, ACL, 2019.
- J. Arkin, R. Paul, D. Park, S. Roy, N. Roy, and T. M. Howard, “Real-time human-robot communication for manipulation tasks in partially observed environments,” in Proc. Int’l. Symp. on Experimental Robotics, Springer, 2020.
- S. Tellex, T. Kollar, S. Dickerson, M. Walter, A. Banerjee, S. Teller, and N. Roy, “Understanding natural language commands for robotic navigation and mobile manipulation,” in Proc. Nat’l Conf. on Artificial Intelligence, AAAI Press, 2011.
- B. Ichter, A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in Proc. Conf. on Robot Learning, PMLR, 2022.
- W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, IEEE, 2016.
- “Automatic Speech Recognition (ASR) — NVIDIA NeMo.” https://docs.nvidia.com/deeplearning/nemo/. (accessed Mar. 18, 2023).
- B. Shillingford, Y. Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, et al., “Large-Scale Visual Speech Recognition,” in Proc. Interspeech, ISCA, 2019.
- P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,” in Proc. European Conf. on Computer Vision, Springer, 2020.
- D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” in Proc. European Conf. on Computer Vision, Springer, 2020.
- H. Chen, H. Tan, A. Kuntz, M. Bansal, and R. Alterovitz, “Enabling robots to understand incomplete natural language instructions using commonsense reasoning,” in Proc. Int’l Conf. on Robotics and Automation, IEEE, 2020.
- T. M. Howard, S. Tellex, and N. Roy, “A natural language planner interface for mobile manipulators,” in Proc. Int’l Conf. on Robotics and Automation, IEEE, 2014.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” OpenAI, 2018.
- J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. of the North American Chapter of the Assoc. for Comp. Linguistics: Human Language Technologies, ACL, 2019.
- A. Bucker, L. Figueredo, S. Haddadinl, A. Kapoor, S. Ma, and R. Bonatti, “Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers,” in Proc. RSJ Int’l Conf. on Intelligent Robots and Systems, IEEE, 2022.
- J. Thomason, J. Sinapov, M. Svetlik, P. Stone, and R. J. Mooney, “Learning multi-modal grounded linguistic semantics by playing “i spy”,” in Proc. Int’l Joint Conf. on Artificial Intelligence, AAAI Press, 2016.
- “Speech to text — Microsoft Azure.” https://azure.microsoft.com/en-us/products/cognitive-services/speech-to-text. (accessed Mar. 18, 2023).
- “Speech-to-Text: Automatic speech recognition - Google Cloud.” https://cloud.google.com/speech-to-text. (accessed Mar. 20, 2023).
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, ISCA, 2020.
- “Distilkobert: Distillation of kobert.” https://github.com/monologg/DistilKoBERT. (accessed Mar. 20, 2023).
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. Int’l Conf. on Machine Learning, ACM, 2006.
- J.-U. Bang, S. Yun, S.-H. Kim, M.-Y. Choi, et al., “Ksponspeech: Korean spontaneous speech corpus for automatic speech recognition,” Applied Sciences, 2020.
- S. Kim, S. Bae, and C. Won, “Kospeech: Open-source toolkit for end-to-end korean speech recognition,” Software Impacts, 2021.
- D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, ISCA, 2019.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. Conf. on Computer Vision and Pattern Recognition, IEEE, 2016.
- Z. Yan, T. Duckett, and N. Bellotto, “Online learning for 3d lidar-based human detection: experimental analysis of point cloud clustering and classification methods,” Autonomous Robots, 2020.
- P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” Proc. Int’l Conf. on Learning Representation, 2018.
- W. Liang, Y. Jiang, and Z. Liu, “Graphvqa: Language-guided graph neural networks for scene graph question answering,” in Proc. Conf. of the North American Chapter of the Assoc. for Comp. Linguistics: Human Language Technologies, ACL, 2021.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int’l Conf. on Learning Representation, 2019.
- I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in Proc. Int’l Conf. on Learning Representation, 2017.
- OpenAI, “Gpt-4 technical report,” 2023.
- I. Kang, “Clova: services and devices powered by ai,” in Proc. Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, ACM, 2018.
- W. Xu, Y. Cai, D. He, J. Lin, and F. Zhang, “Fast-lio2: Fast direct lidar-inertial odometry,” IEEE Transactions on Robotics, 2022.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.