QUAR-VLA: Vision-Language-Action Model for Quadruped Robots (2312.14457v6)
Abstract: The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation (4000 evaluation trials) shows that our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.
- Affordances from human videos as a versatile representation for robotics. 2023.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 2022.
- Introducing our multimodal models, 2023.
- Hydra: Hybrid robot actions for imitation learning. In Proceedings of the 7th Conference on Robot Learning (CoRL), 2023.
- Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE international conference on robotics and automation (ICRA), pages 4243–4250. IEEE, 2018.
- Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.
- Rt-1: Robotics transformer for real-world control at scale, 2023.
- Barkour: Benchmarking animal-level agility with quadruped robots. arXiv preprint arXiv:2305.14654, 2023.
- Berkeley UR5 demonstration dataset. https://sites.google.com/view/berkeley-ur5/home.
- Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021.
- Benchmarking agility for multilegged terrestrial robots. IEEE Transactions on Robotics, 35(2):529–535, 2019.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
- Rt-trajectory: Robotic task generalization via hindsight trajectory sketches, 2023.
- Watch and match: Supercharging imitation with regularized optimal transport. CoRL, 2022.
- Gonet: A semi-supervised deep learning approach for traversability estimation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3044–3051. IEEE, 2018.
- Sacson: Scalable autonomous data collection for social navigation. arXiv preprint arXiv:2306.01874, 2023.
- Anymal - a highly mobile and dynamic quadrupedal robot. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 38–44, 2016.
- BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021.
- Efficient grasping from rgbd images: Learning using a new rectangle representation. In 2011 IEEE International conference on robotics and automation, pages 3304–3311. IEEE, 2011.
- Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE international conference on robotics and automation (ICRA), pages 5129–5136. IEEE, 2018.
- Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. ArXiv, abs/1806.10293, 2018.
- Vinl: Visual navigation and locomotion over obstacles. 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2018–2024, 2022.
- Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022.
- Multi-game decision transformers. Advances in Neural Information Processing Systems, 35:27921–27936, 2022.
- Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In 2019 IEEE International Conference on Robotics and Automation (ICRA), 2019.
- Learning to model the world with language. arXiv preprint arXiv:2308.01399, 2023.
- Multi-stage cable routing through hierarchical imitation learning. ArXiv, abs/2307.08927, 2023.
- Interactive language: Talking to robots in real time. ArXiv, abs/2210.06407, 2022.
- Isaac gym: High performance gpu-based physics simulation for robot learning, 2021.
- Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1048–1055. IEEE, 2019.
- Walk these ways: Tuning robot control for generalization with multiplicity of behavior. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 22–31. PMLR, 14–18 Dec 2023.
- The surprising effectiveness of representation learning for visual imitation, 2021.
- Film: Visual reasoning with a general conditioning layer. In AAAI Conference on Artificial Intelligence, 2017.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Locus 2.0: Robust and computationally efficient lidar odometry for real-time 3d mapping. IEEE Robotics and Automation Letters, pages 1–8, 2022.
- Tokenlearner: Adaptive space-time tokenization for videos. Neural Information Processing Systems,Neural Information Processing Systems, Dec 2021.
- Learning agent-aware affordances for closed-loop interaction with articulated objects. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5916–5922, 2023.
- Rapid exploration for open-world navigation with latent goal models. arXiv preprint arXiv:2104.05859, 2021.
- Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023.
- Vint: A foundation model for visual navigation, 2023.
- The princeton shape benchmark. In Proceedings Shape Modeling Applications, 2004., pages 167–178, 2004.
- Open-world object manipulation using pre-trained vision-language models, 2023.
- Saytap: Language to quadrupedal locomotion. arXiv preprint arXiv:2306.07580, 2023.
- Resilient and distributed multi-robot visual slam: Datasets, experiments, and lessons learned. arXiv preprint arXiv:2304.04362, 2023.
- Tartandrive: A large-scale dataset for learning off-road dynamics models. In 2022 International Conference on Robotics and Automation (ICRA), pages 2546–2552. IEEE, 2022.
- Attention is all you need. Neural Information Processing Systems,Neural Information Processing Systems, Jun 2017.
- Bridgedata v2: A dataset for robot learning at scale. arXiv preprint arXiv:2308.12952, 2023.
- Symbol tuning improves in-context learning in language models. arXiv preprint arXiv:2305.08298, 2023.
- 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
- Webshop: Towards scalable real-world web interaction with grounded language agents. ArXiv, 2022.
- Bdd100k: A diverse driving dataset for heterogeneous multitask learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2633–2642, 2018.
- More than a million ways to be pushed. a high-fidelity experimental dataset of planar pushing. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 30–37. IEEE, 2016.
- Train offline, test online: A real robot learning benchmark. 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9197–9203, 2023.
- Viola: Imitation learning for vision-based manipulation with object proposal priors. ArXiv, abs/2210.11339, 2022.