MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models (2401.14502v1)
Abstract: Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-LLMs to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022a.
- Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451, 2022b.
- Lossless adaptation of pretrained vision models for robotic manipulation. In The Eleventh International Conference on Learning Representations.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
- Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 1507–1514, 2011.
- Learning semantic maps from natural language descriptions. Robotics: Science and Systems, 2013.
- Learning from unscripted deictic gesture and language for human-robot interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
- Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- Clip on wheels: Zero-shot object navigation as object localization and exploration. arXiv preprint arXiv:2203.10421, 2022.
- Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022.
- Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
- R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
- The unsurprising effectiveness of pre-trained vision models for control. arXiv preprint arXiv:2203.03580, 2022.
- Vlmbench: A compositional benchmark for vision-and-language manipulation. arXiv preprint arXiv:2206.08522, 2022.
- Robotic skill acquisition via instruction augmentation with vision-language models. arXiv preprint arXiv:2211.11736, 2022.
- Open-world object manipulation using pre-trained vision-language model. In arXiv preprint, 2023.
- Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022a.
- What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022b.
- Lossless adaptation of pretrained vision models for robotic manipulation. arXiv preprint arXiv:2304.06600, 2023.
- End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
- Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In Conference on Robot Learning (CoRL), 2021. URL https://openreview.net/forum?id=U0Q8CrtBJxJ.
- Active, uncalibrated visual servoing. In Proceedings of the 1994 IEEE International Conference on Robotics and Automation, pages 156–161. IEEE, 1994.
- Survey on visual servoing for manipulation. Computational Vision and Active Perception Laboratory, Fiskartorpsv, 15:2002, 2002.
- Improved force control through visual servoing. In Proceedings of 1995 American Control Conference-ACC’95, volume 1, pages 380–386. IEEE, 1995.
- Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
- Vision-driven compliant manipulation for reliable, high-precision assembly tasks. arXiv preprint arXiv:2106.14070, 2021.
- E. Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021.
- O. Spector and D. Di Castro. Insertionnet-a scalable solution for insertion. IEEE Robotics and Automation Letters, 6(3):5509–5516, 2021.
- Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In 2019 International Conference on Robotics and Automation (ICRA), pages 8943–8950. IEEE, 2019.
- See, feel, act: Hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics, 4(26):eaav3123, 2019.
- More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018.
- A review of tactile information: Perception and action through touch. IEEE Transactions on Robotics, 36(6):1619–1634, 2020.
- T. Narita and O. Kroemer. Policy blending and recombination for multimodal contact-rich tasks. IEEE Robotics and Automation Letters, 6(2):2721–2728, 2021.
- Multiscale sensor fusion and continuous control with neural cdes. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10897–10904. IEEE, 2022.
- K.-i. Funahashi and Y. Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural networks, 6(6):801–806, 1993.
- Neural controlled differential equations for irregular time series. Advances in Neural Information Processing Systems, 33:6696–6707, 2020.
- Dynamic in-hand sliding manipulation. IEEE Transactions on Robotics, 33(4):778–795, 2017.
- C. Mucchiani and M. Yim. Dynamic grasping for object picking using passive zero-dof end-effectors. IEEE Robotics and Automation Letters, 6(2):3089–3096, 2021.
- Learning reactive and predictive differentiable controllers for switching linear dynamical models. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 7563–7569. IEEE, 2021.
- R. Shu and R. Hollis. Momentum based whole-body optimal planning for a single-spherical-wheeled balancing mobile manipulator. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3221–3226. IEEE, 2021.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
- End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020.
- Frustratingly simple domain generalization via image stylization. arXiv preprint arXiv:2006.11207, 2020.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
- The ballbot: An omnidirectional balancing mobile robot. The International Journal of Robotics Research, 33(6):917–930, 2014.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
- Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation. arXiv preprint arXiv:2212.04573, 2022.
- A modular robotic arm control stack for research: Franka-interface and frankapy. arXiv preprint arXiv:2011.02398, 2020.
- Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017.
- A. Yamaguchi and C. G. Atkeson. Combining finger vision and optical tactile sensing: Reducing and handling errors while cutting vegetables. In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pages 1045–1051. IEEE, 2016.
- Leveraging multimodal haptic sensory data for robust cutting. In 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pages 409–416. IEEE, 2019.
- Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi:10.1109/IROS.2012.6386109.
- E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2022.
- T. Lüddecke and A. Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.