Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models (2401.14502v1)

Published 25 Jan 2024 in cs.RO, cs.CV, and cs.LG

Abstract: Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-LLMs to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  2. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  3. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  4. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  5. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  6. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  7. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  8. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022a.
  9. Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451, 2022b.
  10. Lossless adaptation of pretrained vision models for robotic manipulation. In The Eleventh International Conference on Learning Representations.
  11. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  12. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 1507–1514, 2011.
  13. Learning semantic maps from natural language descriptions. Robotics: Science and Systems, 2013.
  14. Learning from unscripted deictic gesture and language for human-robot interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
  15. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
  16. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
  17. Clip on wheels: Zero-shot object navigation as object localization and exploration. arXiv preprint arXiv:2203.10421, 2022.
  18. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022.
  19. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
  20. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  21. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
  22. The unsurprising effectiveness of pre-trained vision models for control. arXiv preprint arXiv:2203.03580, 2022.
  23. Vlmbench: A compositional benchmark for vision-and-language manipulation. arXiv preprint arXiv:2206.08522, 2022.
  24. Robotic skill acquisition via instruction augmentation with vision-language models. arXiv preprint arXiv:2211.11736, 2022.
  25. Open-world object manipulation using pre-trained vision-language model. In arXiv preprint, 2023.
  26. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022a.
  27. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022b.
  28. Lossless adaptation of pretrained vision models for robotic manipulation. arXiv preprint arXiv:2304.06600, 2023.
  29. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  30. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
  31. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In Conference on Robot Learning (CoRL), 2021. URL https://openreview.net/forum?id=U0Q8CrtBJxJ.
  32. Active, uncalibrated visual servoing. In Proceedings of the 1994 IEEE International Conference on Robotics and Automation, pages 156–161. IEEE, 1994.
  33. Survey on visual servoing for manipulation. Computational Vision and Active Perception Laboratory, Fiskartorpsv, 15:2002, 2002.
  34. Improved force control through visual servoing. In Proceedings of 1995 American Control Conference-ACC’95, volume 1, pages 380–386. IEEE, 1995.
  35. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
  36. Vision-driven compliant manipulation for reliable, high-precision assembly tasks. arXiv preprint arXiv:2106.14070, 2021.
  37. E. Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021.
  38. O. Spector and D. Di Castro. Insertionnet-a scalable solution for insertion. IEEE Robotics and Automation Letters, 6(3):5509–5516, 2021.
  39. Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In 2019 International Conference on Robotics and Automation (ICRA), pages 8943–8950. IEEE, 2019.
  40. See, feel, act: Hierarchical learning for complex manipulation skills with multisensory fusion. Science Robotics, 4(26):eaav3123, 2019.
  41. More than a feeling: Learning to grasp and regrasp using vision and touch. IEEE Robotics and Automation Letters, 3(4):3300–3307, 2018.
  42. A review of tactile information: Perception and action through touch. IEEE Transactions on Robotics, 36(6):1619–1634, 2020.
  43. T. Narita and O. Kroemer. Policy blending and recombination for multimodal contact-rich tasks. IEEE Robotics and Automation Letters, 6(2):2721–2728, 2021.
  44. Multiscale sensor fusion and continuous control with neural cdes. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10897–10904. IEEE, 2022.
  45. K.-i. Funahashi and Y. Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural networks, 6(6):801–806, 1993.
  46. Neural controlled differential equations for irregular time series. Advances in Neural Information Processing Systems, 33:6696–6707, 2020.
  47. Dynamic in-hand sliding manipulation. IEEE Transactions on Robotics, 33(4):778–795, 2017.
  48. C. Mucchiani and M. Yim. Dynamic grasping for object picking using passive zero-dof end-effectors. IEEE Robotics and Automation Letters, 6(2):3089–3096, 2021.
  49. Learning reactive and predictive differentiable controllers for switching linear dynamical models. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 7563–7569. IEEE, 2021.
  50. R. Shu and R. Hollis. Momentum based whole-body optimal planning for a single-spherical-wheeled balancing mobile manipulator. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3221–3226. IEEE, 2021.
  51. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
  52. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
  53. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  54. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  55. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020.
  56. Frustratingly simple domain generalization via image stylization. arXiv preprint arXiv:2006.11207, 2020.
  57. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  58. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
  59. The ballbot: An omnidirectional balancing mobile robot. The International Journal of Robotics Research, 33(6):917–930, 2014.
  60. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  61. Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation. arXiv preprint arXiv:2212.04573, 2022.
  62. A modular robotic arm control stack for research: Franka-interface and frankapy. arXiv preprint arXiv:2011.02398, 2020.
  63. Gelsight: High-resolution robot tactile sensors for estimating geometry and force. Sensors, 17(12):2762, 2017.
  64. A. Yamaguchi and C. G. Atkeson. Combining finger vision and optical tactile sensing: Reducing and handling errors while cutting vegetables. In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pages 1045–1051. IEEE, 2016.
  65. Leveraging multimodal haptic sensory data for robust cutting. In 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pages 409–416. IEEE, 2019.
  66. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  67. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi:10.1109/IROS.2012.6386109.
  68. E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2022.
  69. T. Lüddecke and A. Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com