Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Yell At Your Robot: Improving On-the-Fly from Language Corrections (2403.12910v1)

Published 19 Mar 2024 in cs.RO, cs.AI, and cs.LG
Yell At Your Robot: Improving On-the-Fly from Language Corrections

Abstract: Hierarchical policies that combine language and low-level control have been shown to perform impressively long-horizon robotic tasks, by leveraging either zero-shot high-level planners like pretrained language and vision-LLMs (LLMs/VLMs) or models trained on annotated robotic demonstrations. However, for complex and dexterous skills, attaining high success rates on long-horizon tasks still represents a major challenge -- the longer the task is, the more likely it is that some stage will fail. Can humans help the robot to continuously improve its long-horizon task performance through intuitive and natural feedback? In this paper, we make the following observation: high-level policies that index into sufficiently rich and expressive low-level language-conditioned skills can be readily supervised with human feedback in the form of language corrections. We show that even fine-grained corrections, such as small movements ("move a bit to the left"), can be effectively incorporated into high-level policies, and that such corrections can be readily obtained from humans observing the robot and making occasional suggestions. This framework enables robots not only to rapidly adapt to real-time language feedback, but also incorporate this feedback into an iterative training scheme that improves the high-level policy's ability to correct errors in both low-level execution and high-level decision-making purely from verbal feedback. Our evaluation on real hardware shows that this leads to significant performance improvement in long-horizon, dexterous manipulation tasks without the need for any additional teleoperation. Videos and code are available at https://yay-robot.github.io/.

Language-Driven Robot Learning and Adaptation: A New Framework for Improving Robotic Task Performance

Introduction

Robotics research has long pursued the capability for robots to perform complex tasks that involve multiple stages and precise maneuvers. Traditionally, the development of high-level policies for orchestrating such tasks has been hindered by the challenge of obtaining scalable, high-quality training data. In their recent contribution, Lucy Xiaoyang Shi \textit{et al.} introduce a novel framework, Yell At Your Robot (YAY Robot), aimed at leveraging natural language as both a medium for human-robot interaction and a mechanism for learning. Their framework is particularly designed to improve robots' performance on long-horizon tasks through the incorporation of language corrections, enabling on-the-fly adaptation and continuous improvement based purely on verbal feedback.

Approach Overview

The paper proposes a hierarchical policy structure where a high-level policy generates language instructions interpreted and executed by a lower-level policy. This setup leverages the expressive power of natural language to bridge the gap between user expectations and robot actions. A key innovation of their approach is its capacity to harness verbal corrections from human observers to refine the robot's behavior in real-time and iteratively improve the high-level decision-making policy.

The efficacy of this framework is showcased in three bi-manual manipulation tasks: bag packing, trail mix preparation, and plate cleaning. These tasks are selected for their relevance to practical applications and their requirement for delicate manipulations and precise control.

Implementational Details

At the core of their system is a Language-Conditioned Behavior Cloning (LCBC) policy learning from a dataset annotated with verbal instructions. The high-level policy is responsible for generating these language instructions based on the robot's observations, while the low-level policy translates these instructions into actionable commands. Human-provided corrections directly intervene in the high-level policy's outputs, offering a straightforward path for real-time adjustments.

One of the noteworthy aspects of their implementation is the efficiency in data annotation, facilitated by a live-narration method where operators speak the instructions synchronously with teleoperating the robot. This method not only increases the volume of obtainable data but also enriches the diversity of scenarios and corrections the robot can learn from.

Experimental Insights

The evaluation of YAY Robot on real-world tasks presented significant findings. With the inclusion of language corrections, task success rates saw improvements ranging from 15\% to 50\% across different task stages, underscoring the value of verbal feedback in enhancing robotic performance. Moreover, the iterative finetuning of the high-level policy with corrective feedback progressively reduced the necessity for human intervention.

Comparative analysis against non-hierarchical imitation learning methods demonstrated the superiority of the hierarchical approach, particularly in handling complex tasks with multiple stages and potential points of failure.

Future Directions and Limitations

While the framework showcases promising results, the reliance on a sophisticated low-level policy capable of interpreting a wide range of language instructions underscores a notable limitation. Future research directions may include enhancing the flexibility and robustness of the low-level policy and exploring the integration of non-verbal communication forms such as gestures for richer human-robot interactions.

Final Thoughts

YAY Robot represents a significant step towards more interactive and adaptable robotic systems, where natural language serves as the bridge between human intuition and robotic action. Through innovative data annotation techniques and hierarchical policy design, this work paves the way for robots to not only perform complex tasks more effectively but also evolve through interaction with their human users.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Do as i can, not as i say: Grounding language in robotic affordances. Conference On Robot Learning, 2022.
  2. Hierarchical reinforcement learning with natural language subgoals. arXiv preprint arXiv:2309.11564, 2023.
  3. Learning with latent language. arXiv preprint arXiv:1711.00482, 2017.
  4. Multi-task learning for continuous control. arXiv preprint arXiv:1802.01034, 2018.
  5. Affordances from human videos as a versatile representation for robotics. Computer Vision And Pattern Recognition, 2023. doi: 10.1109/CVPR52729.2023.01324.
  6. Rt-h: Action hierarchies using language. arXiv preprint arXiv: 2403.01823, 2024.
  7. Real-time natural language corrections for assistive robotic manipulators. The International Journal of Robotics Research, 36(5-7):684–698, 2017.
  8. Rt-1: Robotics transformer for real-world control at scale. Robotics: Science And Systems, 2022. doi: 10.48550/arXiv.2212.06817.
  9. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  10. Latte: Language trajectory transformer. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7287–7294. IEEE, 2023.
  11. Multimodal error correction with natural language and pointing gestures. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 1976–1986, October 2023.
  12. ”no, to the right” - online language corrections for robotic manipulation via shared autonomy. Ieee/acm International Conference On Human-robot Interaction, 2023. doi: 10.1145/3568162.3578623.
  13. Learning parameterized skills. arXiv preprint arXiv:1206.6398, 2012.
  14. Multi-task policy search for robotics. In 2014 IEEE international conference on robotics and automation (ICRA), pages 3876–3881. IEEE, 2014.
  15. Task and motion planning with large language models for object rearrangement. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2086–2092, 2023. doi: 10.1109/IROS55552.2023.10342169.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference On Learning Representations, 2020.
  17. Palm-e: An embodied multimodal language model. International Conference On Machine Learning, 2023. doi: 10.48550/arXiv.2303.03378.
  18. Learning to sequence multiple tasks with competing constraints. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2672–2678. IEEE, 2019.
  19. Lisa: Learning interpretable skill abstractions from language. Advances in Neural Information Processing Systems, 35:21711–21724, 2022.
  20. Integrated task and motion planning. Annu. Rev. Control. Robotics Auton. Syst., 4:265–293, 2021. doi: 10.1146/ANNUREV-CONTROL-091420-084139.
  21. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. arXiv preprint arXiv:2109.08273, 2021a.
  22. Lazydagger: Reducing context switching in interactive imitation learning. In 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), pages 502–509. IEEE, 2021b.
  23. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022.
  24. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  25. Interactive task planning through natural language. In Proceedings of IEEE International Conference on Robotics and Automation, volume 1, pages 24–29 vol.1, 1996. doi: 10.1109/ROBOT.1996.503568.
  26. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  27. Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  28. Hg-dagger: Interactive imitation learning with human experts. Ieee International Conference On Robotics And Automation, 2018. doi: 10.1109/ICRA.2019.8793698.
  29. Reinforcement learning to adjust parametrized motor primitives to new situations. Autonomous Robots, 33:361–379, 2012.
  30. Toward understanding natural language directions. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 259–266. IEEE, 2010a.
  31. Grounding verbs of motion in natural language commands to robots. In International Symposium on Experimental Robotics, 2010b.
  32. Code as policies: Language model programs for embodied control. Ieee International Conference On Robotics And Automation, 2022. doi: 10.1109/ICRA48891.2023.10160591.
  33. Learning to learn faster from human feedback with language model predictive control. 2024.
  34. Robot learning on the job: Human-in-the-loop autonomy and learning during deployment. arXiv preprint arXiv:2211.08416, 2022.
  35. Interactive robot learning from verbal correction. arXiv preprint arXiv: 2310.17555, 2023.
  36. Multi-stage cable routing through hierarchical imitation learning. Ieee Transactions On Robotics, 2023. doi: 10.48550/arXiv.2307.08927.
  37. Grounding language in play. arXiv preprint arXiv:2005.07648, 3, 2020a.
  38. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020b.
  39. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  40. Walk the talk: Connecting language, knowledge, and action in route instructions. Def, 2(6):4, 2006.
  41. Human-in-the-loop imitation learning using remote teleoperation. arXiv preprint arXiv:2012.06733, 2020.
  42. Learning reusable manipulation strategies. Conference On Robot Learning, 2023. doi: 10.48550/arXiv.2311.03293.
  43. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  44. Structured world models from human videos. Robotics: Science And Systems, 2023. doi: 10.15607/RSS.2023.XIX.012.
  45. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3):281–300, 2016.
  46. Gpt-4 technical report. arXiv preprint arXiv: 2303.08774, 2023.
  47. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31, 2018.
  48. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.
  49. Film: Visual reasoning with a general conditioning layer. Aaai Conference On Artificial Intelligence, 2017. doi: 10.1609/aaai.v32i1.11671.
  50. Learning transferable visual models from natural language supervision. International Conference On Machine Learning, 2021.
  51. Robust speech recognition via large-scale weak supervision. International Conference On Machine Learning, 2022. doi: 10.48550/arXiv.2212.04356.
  52. A reduction of imitation learning and structured prediction to no-regret online learning. International Conference On Artificial Intelligence And Statistics, 2010.
  53. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. NEURIPS, 2019.
  54. Skill induction and planning with latent language. arXiv preprint arXiv:2110.01517, 2021.
  55. Correcting robot plans with natural language feedback. arXiv preprint arXiv:2204.05186, 2022.
  56. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  57. Hierarchical and interpretable skill acquisition in multi-task reinforcement learning. arXiv preprint arXiv:1712.07294, 2017.
  58. Robotic telekinesis: Learning a robotic hand imitator by watching humans on youtube. Robotics: Science And Systems, 2022. doi: 10.15607/rss.2022.xviii.023.
  59. Language-conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33:13139–13150, 2020.
  60. Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference On Machine Learning, 2019.
  61. ALOHA 2 Team. Aloha 2: An enhanced low-cost hardware for bimanual teleoperation, 2024. URL https://aloha-2.github.io/.
  62. Distral: Robust multitask reinforcement learning. Advances in neural information processing systems, 30, 2017.
  63. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 1507–1514, 2011.
  64. Attention is all you need. arXiv preprint arXiv: 1706.03762, 2017.
  65. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  66. Neural Semantic Parsing with Anonymization for Command Understanding in General-Purpose Service Robots, pages 337–350. Springer International Publishing, 2019. doi: 10.1007/978-3-030-35699-6˙26.
  67. Mosaic: A modular system for assistive and interactive cooking. arXiv preprint arXiv: 2402.18796, 2024.
  68. Deep imitation learning for bimanual robotic manipulation. Neural Information Processing Systems, 2020.
  69. Multi-step tasks learning based on bayesian segmentation and dynamic movement primitive. In 2021 IEEE 11th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), pages 265–270. IEEE, 2021.
  70. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv: 2304.13705, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Lucy Xiaoyang Shi (8 papers)
  2. Zheyuan Hu (23 papers)
  3. Tony Z. Zhao (16 papers)
  4. Archit Sharma (31 papers)
  5. Karl Pertsch (35 papers)
  6. Jianlan Luo (22 papers)
  7. Sergey Levine (531 papers)
  8. Chelsea Finn (264 papers)
Citations (35)