CognitiveDog: Large Multimodal Model Based System to Translate Vision and Language into Action of Quadruped Robot (2401.09388v1)
Abstract: This paper introduces CognitiveDog, a pioneering development of quadruped robot with Large Multi-modal Model (LMM) that is capable of not only communicating with humans verbally but also physically interacting with the environment through object manipulation. The system was realized on Unitree Go1 robot-dog equipped with a custom gripper and demonstrated autonomous decision-making capabilities, independently determining the most appropriate actions and interactions with various objects to fulfill user-defined tasks. These tasks do not necessarily include direct instructions, challenging the robot to comprehend and execute them based on natural language input and environmental cues. The paper delves into the intricacies of this system, dataset characteristics, and the software architecture. Key to this development is the robot's proficiency in navigating space using Visual-SLAM, effectively manipulating and transporting objects, and providing insightful natural language commentary during task execution. Experimental results highlight the robot's advanced task comprehension and adaptability, underscoring its potential in real-world applications. The dataset used to fine-tune the robot-dog behavior generation model is provided at the following link: huggingface.co/datasets/ArtemLykov/CognitiveDog_dataset
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022).
- The Falcon Series of Open Language Models. arXiv:2311.16867 [cs.CL]
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023).
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 (2022).
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023).
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023).
- Boston Dynamics. 2023. Robots That Can Chat. https://bostondynamics.com/blog/robots-that-can-chat.
- Introducing ChatGPT. https://openai.com/blog/chatgpt.
- Reasoning with Language Model is Planning with World Model. arXiv:2305.14992 [cs.CL]
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608 (2022).
- Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought. arXiv:2305.15021 [cs.RO]
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Vilius Petkauskas. 2023. ChatGPT injected into Boston Dynamics’ Spot. https://cybernews.com/tech/chatgpt-google-boston-dynamics-spot/.
- Sundar Pichai and Demis Hassabis. 2023. Introducing Gemini: Google’s most capable AI model yet. https://blog.google/technology/ai/google-gemini-ai/#sundar-note [Online; accessed 8-December-2023].
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 11523–11530.
- Yihao Su. 2023. Artificial Intelligence: The Significance of Tesla Bot. Highlights in Science, Engineering and Technology 39 (04 2023), 1351–1355. https://doi.org/10.54097/hset.v39i.6767
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- TinyChatEngine: On-Device LLM Inference Library. https://github.com/mit-han-lab/TinyChatEngine.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).