Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models (2211.11736v3)

Published 21 Nov 2022 in cs.RO, cs.AI, and cs.LG

Abstract: In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Such methods typically learn from corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-LLMs (VLMs) like CLIP or ViLD have been applied to robotics for learning representations and scene descriptors. Can these pretrained models serve as automatic labelers for robot data, effectively importing Internet-scale knowledge into existing datasets to make them useful even for tasks that are not reflected in their ground truth annotations? To accomplish this, we introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL): we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets. This method enables cheaper acquisition of useful language descriptions compared to expensive human labels, allowing for more efficient label coverage of large-scale datasets. We apply DIAL to a challenging real-world robotic manipulation domain where 96.5% of the 80,000 demonstrations do not contain crowd-sourced language annotations. DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=bdHkMjBJG_w.
  2. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  3. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
  4. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  5. Learning to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946, 2018.
  6. Learning to map natural language instructions to physical quadcopter control using simulated flight. arXiv preprint arXiv:1910.09664, 2019.
  7. Rt-1: Robotics transformer for real-world control at scale. In arXiv preprint arXiv:2212.06817, 2022a.
  8. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022b.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
  11. Actrce: Augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv preprint arXiv:1902.04546, 2019.
  12. Actionable models: Unsupervised offline reinforcement learning of robotic skills. arXiv preprint arXiv:2104.07749, 2021.
  13. Higher: Improving instruction following with hindsight generation for experience replay. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pages 225–232. IEEE, 2020.
  14. From play to policy: Conditional behavior generation from uncurated robot data. arXiv preprint arXiv:2210.10047, 2022.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  17. Imitation learning for natural language direction following through unknown environments. In 2013 IEEE International Conference on Robotics and Automation, pages 1047–1053. IEEE, 2013.
  18. Minedojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv:2206.08853, 2022.
  19. Discriminatory analysis. nonparametric discrimination: Consistency properties. International Statistical Review/Revue Internationale de Statistique, 57(3):238–247, 1989.
  20. From language to goals: Inverse reinforcement learning for vision-based instruction following. arXiv preprint arXiv:1902.07742, 2019.
  21. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  22. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022a.
  23. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
  24. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  25. Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  26. Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094–8. Citeseer, 1993.
  27. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212, 2021.
  28. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022.
  29. A survey of reinforcement learning informed by natural language. arXiv preprint arXiv:1906.03926, 2019.
  30. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020.
  31. Learning latent plans from play. In Conference on robot learning, pages 1113–1132. PMLR, 2020.
  32. Interactive language: Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022.
  33. What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298, 2021.
  34. Grounding language with visual affordances over unstructured data. arXiv preprint arXiv:2210.01911, 2022a.
  35. What matters in language conditioned robotic imitation learning. arXiv preprint arXiv:2204.06252, 2022b.
  36. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022a.
  37. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022b.
  38. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
  39. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  40. Grounding hindsight instructions in multi-goal reinforcement learning for robotics. arXiv preprint arXiv:2204.04308, 2022.
  41. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  42. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  43. Lancon-learn: Learning with language to enable generalization in multi-task manipulation. IEEE Robotics and Automation Letters, 7(2):1635–1642, 2021.
  44. Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33:13139–13150, 2020.
  45. Approaching the symbol grounding problem with probabilistic graphical models. AI magazine, 32(4):64–76, 2011.
  46. Thinking while moving: Deep reinforcement learning with concurrent control. arXiv preprint arXiv:2004.06089, 2020.
  47. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ted Xiao (40 papers)
  2. Harris Chan (13 papers)
  3. Pierre Sermanet (37 papers)
  4. Ayzaan Wahid (21 papers)
  5. Anthony Brohan (8 papers)
  6. Karol Hausman (56 papers)
  7. Sergey Levine (531 papers)
  8. Jonathan Tompson (49 papers)
Citations (54)