Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RT-H: Action Hierarchies Using Language (2403.01823v2)

Published 4 Mar 2024 in cs.RO and cs.AI

Abstract: Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in multi-task datasets. However, as tasks become more semantically diverse (e.g., "pick coke can" and "pour cup"), sharing data between tasks becomes harder, so learning to map high-level tasks to actions requires much more demonstration data. To bridge tasks and actions, our insight is to teach the robot the language of actions, describing low-level motions with more fine-grained phrases like "move arm forward". Predicting these language motions as an intermediate step between tasks and actions forces the policy to learn the shared structure of low-level motions across seemingly disparate tasks. Furthermore, a policy that is conditioned on language motions can easily be corrected during execution through human-specified language motions. This enables a new paradigm for flexible policies that can learn from human intervention in language. Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages. We show that RT-H leverages this language-action hierarchy to learn policies that are more robust and flexible by effectively tapping into multi-task datasets. We show that these policies not only allow for responding to language interventions, but can also learn from such interventions and outperform methods that learn from teleoperated interventions. Our website and videos are found at https://rt-hierarchy.github.io.

Introducing RT-H: A Novel Framework for Robotic Policies Using Language-Conditioned Action Hierarchies

Leveraging Language for Robotic Task Learning

The ability to understand and perform a wide range of tasks with minimal supervision is a coveted goal in robotics. A promising approach to this problem involves teaching robots to understand tasks through the lens of natural language. Recent advancements have seen robots being instructed using high-level task descriptions, benefiting from the inherent structure and adaptability language offers. However, a significant hurdle emerges as the diversity of these tasks increases, making the direct mapping from task descriptions to actions less effective due to the need for substantially more demonstration data.

RT-H: Bridging the Gap with Language Motions

To address this challenge, we introduce a novel framework, RT-H (Robot Transformer with Action Hierarchies), which enhances the robot's understanding of tasks by incorporating an intermediate layer of "language motions." These fine-grained phrases, like “move arm forward” or “close gripper,” serve as stepping stones between high-level tasks and the actual robot actions, facilitating a more robust learning process.

The RT-H framework operates through two main phases:

  1. Language Motion Prediction: The model predicts the next language motion based on the current visual observations and the high-level task description.
  2. Action Prediction: Conditioned on the visual context, the high-level task, and the inferred language motion, the model predicts the precise actions to execute.

This hierarchical approach, grounded in language, not only leads to better performance on diverse tasks by leveraging shared low-level motions across tasks but also allows for more intuitive human-robot interactions. Humans can easily correct or guide the robot using language motions, providing a pathway for rapid learning and adaptability.

Robust Experimental Validation

RT-H's efficacy is demonstrated through rigorous experimentation. The framework shows a substantial improvement in policy performance on a composite dataset consisting of multiple tasks, outperforming the flat model RT-2 by a notable margin. Additionally, RT-H exhibits outstanding flexibility and contextuality in handling language motions, effectively responding to corrections and adapting its behavior according to the task and scene context. This adaptability extends to unseen tasks and conditions, where RT-H, with minimal human intervention, demonstrates promising success rates, highlighting its potential for generalization.

Future Directions: Beyond the Current State

Despite these achievements, RT-H's journey is far from complete. Future research directions include exploring varying levels of abstraction within the action hierarchy, extending the methodology to integrate multiple steps of action reasoning, and enhancing the model's ability to learn from human videos with actions described solely in language. Additionally, incorporating RT-H's compressed action space into reinforcement learning methods could pave the way for more efficient policy exploration and learning.

Conclusion

RT-H sets a new standard in robot learning, illustrating the profound impact of intertwining language with robotic action prediction. By fostering a deeper connection between language and actions, RT-H not only advances our ability to teach robots a diverse array of tasks but also enhances the intuitiveness and effectiveness of human-robot interaction. As we continue to explore this promising avenue, the future of robotic task learning looks ever more auspicious.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023a.
  2. No, to the right: Online language corrections for robotic manipulation via shared autonomy. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’23, page 93–101, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450399647. doi: 10.1145/3568162.3578623. URL https://doi.org/10.1145/3568162.3578623.
  3. Correcting robot plans with natural language feedback. ArXiv, abs/2204.05186, 2022a. URL https://api.semanticscholar.org/CorpusID:248085271.
  4. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023b.
  5. Human-in-the-loop imitation learning using remote teleoperation. CoRR, abs/2012.06733, 2020. URL https://arxiv.org/abs/2012.06733.
  6. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  7. BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv.
  8. Language-conditioned imitation learning for robot manipulation tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13139–13150. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/9909794d52985cbc5d95c26e31125d1a-Paper.pdf.
  9. Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021.
  10. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022.
  11. KITE: Keypoint-conditioned policies for semantic manipulation. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=veGdf4L4Xz.
  12. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/radford21a.html.
  13. R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR, 2023.
  14. Language-driven representation learning for robotics. In Robotics: Science and Systems (RSS), 2023.
  15. Vip: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations, 2022.
  16. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
  17. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 31(3):360–375, 2012. doi: 10.1177/0278364911428653. URL https://doi.org/10.1177/0278364911428653.
  18. Learning and generalization of complex tasks from unstructured demonstrations. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5239–5246, 2012. doi: 10.1109/IROS.2012.6386006.
  19. Ddco: Discovery of deep continuous options for robot learning from demonstrations. In Conference on robot learning, pages 418–437. PMLR, 2017.
  20. Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pages 8624–8633. PMLR, 2020.
  21. Compile: Compositional imitation learning and execution. In International Conference on Machine Learning, pages 3418–3428. PMLR, 2019.
  22. Discovering motor programs by recomposing demonstrations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgHY0NYwr.
  23. Skid raw: Skill discovery from raw trajectories. IEEE robotics and automation letters, 6(3):4696–4703, 2021.
  24. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022.
  25. Hierarchical few-shot imitation with skill transition models. In International Conference on Learning Representations, 2021.
  26. Robust imitation of diverse behaviors. Advances in Neural Information Processing Systems, 30, 2017.
  27. Learning latent plans from play. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors, Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 1113–1132. PMLR, 30 Oct–01 Nov 2020. URL https://proceedings.mlr.press/v100/lynch20a.html.
  28. PLATO: Predicting latent affordances through object-centric play. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=UAA5bNospA0.
  29. Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021.
  30. Hydra: Hybrid robot actions for imitation learning. In Conference on Robot Learning, pages 2113–2133. PMLR, 2023a.
  31. Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, pages 1769–1782. PMLR, 2023.
  32. Robovqa: Multimodal long-horizon reasoning for robotics. In arXiv preprint arXiv:2311.00899, 2023.
  33. ELLA: Exploration through learned language abstraction. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=VvUldGZ3izR.
  34. Improving long-horizon imitation through instruction prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7857–7865, 2023.
  35. Thought Cloning: Learning to think while acting by imitating human thinking. Advances in Neural Information Processing Systems, 2023.
  36. Skill induction and planning with latent language. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1713–1726, 2022b.
  37. Xinjie Liu. Interactive imitation learning in robotics based on simulations, 2022.
  38. No-regret reductions for imitation learning and structured prediction. CoRR, abs/1011.0686, 2010. URL http://arxiv.org/abs/1011.0686.
  39. Hg-dagger: Interactive imitation learning with human experts. In 2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019.
  40. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. In Conference on Robot Learning, pages 598–608. PMLR, 2022.
  41. Lazydagger: Reducing context switching in interactive imitation learning. In CASE, pages 502–509, 2021. URL https://doi.org/10.1109/CASE49439.2021.9551469.
  42. Query-efficient imitation learning for end-to-end simulated driving. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 2891–2897. AAAI Press, 2017.
  43. Ensembledagger: A bayesian approach to safe imitation learning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5041–5048, 2019. doi: 10.1109/IROS40897.2019.8968287.
  44. Learning human objectives from sequences of physical corrections. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 2877–2883. IEEE, 2021.
  45. Physical interaction as communication: Learning robot objectives online from human corrections. The International Journal of Robotics Research, 41(1):20–44, 2022.
  46. Distilling and retrieving generalizable knowledge for robot manipulation via language corrections. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
  47. Real-time natural language corrections for assistive robotic manipulators. The International Journal of Robotics Research, 36(5-7):684–698, 2017.
  48. Reshaping robot trajectories using natural language commands: A study of multi-modal data alignment using transformers. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 978–984. IEEE, 2022a.
  49. Latte: Language trajectory transformer, 2022b.
  50. Guiding policies with language via meta-learning. In International Conference on Learning Representations, 2018.
  51. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, pages 1–8, 2023. doi: 10.1109/LRA.2023.3295255.
  52. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023.
  53. S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982. doi: 10.1109/TIT.1982.1056489.
  54. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023.
  55. What matters in learning from offline human demonstrations for robot manipulation. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=JrsfBJtDFdI.
  56. Data quality in imitation learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=FwmvbuDiMk.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Suneel Belkhale (18 papers)
  2. Tianli Ding (11 papers)
  3. Ted Xiao (40 papers)
  4. Pierre Sermanet (37 papers)
  5. Quon Vuong (1 paper)
  6. Jonathan Tompson (49 papers)
  7. Yevgen Chebotar (28 papers)
  8. Debidatta Dwibedi (21 papers)
  9. Dorsa Sadigh (162 papers)
Citations (39)