Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches (2403.02709v1)

Published 5 Mar 2024 in cs.RO

Abstract: Natural language and images are commonly used as goal representations in goal-conditioned imitation learning (IL). However, natural language can be ambiguous and images can be over-specified. In this work, we propose hand-drawn sketches as a modality for goal specification in visual imitation learning. Sketches are easy for users to provide on the fly like language, but similar to images they can also help a downstream policy to be spatially-aware and even go beyond images to disambiguate task-relevant from task-irrelevant objects. We present RT-Sketch, a goal-conditioned policy for manipulation that takes a hand-drawn sketch of the desired scene as input, and outputs actions. We train RT-Sketch on a dataset of paired trajectories and corresponding synthetically generated goal sketches. We evaluate this approach on six manipulation skills involving tabletop object rearrangements on an articulated countertop. Experimentally we find that RT-Sketch is able to perform on a similar level to image or language-conditioned agents in straightforward settings, while achieving greater robustness when language goals are ambiguous or visual distractors are present. Additionally, we show that RT-Sketch has the capacity to interpret and act upon sketches with varied levels of specificity, ranging from minimal line drawings to detailed, colored drawings. For supplementary material and videos, please refer to our website: http://rt-sketch.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Modular multitask reinforcement learning with policy sketches. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 166–175. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/andreas17a.html.
  2. Hindsight experience replay. In 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.
  3. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
  4. Sketch-based robot programming. In 2010 25th International Conference of Image and Vision Computing New Zealand, pages 1–8. IEEE, 2010.
  5. Data quality in imitation learning. arXiv preprint arXiv:2306.02437, 2023.
  6. Doodle it yourself: Class incremental learning by drawing a few sketches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2293–2302, 2022.
  7. Sketch2saliency: Learning to detect salient objects from human drawings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2733–2743, 2023.
  8. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818, 2023a.
  9. RT-1: Robotics Transformer for Real-World Control at Scale. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023b. doi: 10.15607/RSS.2023.XIX.025.
  10. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  11. Robust manipulation with spatial features. In CoRL 2022 Workshop on Pre-training Robot Learning, 2022.
  12. What can human sketches do for object detection? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15083–15094, 2023a.
  13. Scenetrilogy: On human scene-sketch and its complementarity with photo and text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10972–10983, 2023b.
  14. Can foundation models perform zero-shot task specification for robot manipulation? In Learning for Dynamics and Control Conference, pages 893–905. PMLR, 2022.
  15. No, to the right: Online language corrections for robotic manipulation via shared autonomy. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, pages 93–101, 2023.
  16. Mechanical search: Multi-step retrieval of a target object occluded by clutter. In 2019 International Conference on Robotics and Automation (ICRA), pages 1614–1621. IEEE, 2019.
  17. Goal-conditioned imitation learning. Advances in neural information processing systems, 32, 2019.
  18. Learning dense visual correspondences in simulation to smooth and fold real fabrics. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 11515–11522. IEEE, 2021.
  19. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023.
  20. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  21. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  22. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022.
  23. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023.
  24. Picture that sketch: Photorealistic image generation from abstract sketches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6850–6861, 2023.
  25. Emergent communication in interactive sketch question answering. arXiv preprint arXiv:2310.15597, 2023.
  26. Photo-sketching: Inferring contour drawings from images. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1403–1412. IEEE, 2019.
  27. Rensis Likert. A technique for the measurement of attitudes. Archives of Psychology, 1932.
  28. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020.
  29. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  30. kpam: Keypoint affordances for category-level robotic manipulation. In The International Symposium of Robotics Research, pages 132–157. Springer, 2019.
  31. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  32. Sketching robot programs on the fly. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’23, page 584–593, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450399647. doi: 10.1145/3568162.3576991. URL https://doi.org/10.1145/3568162.3576991.
  33. Emergent graphical conventions in a visual communication game. Advances in Neural Information Processing Systems, 35:13119–13131, 2022.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  35. Goal-conditioned imitation learning using score-based diffusion policies. Robotics: Science and Systems (RSS), 2023.
  36. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  37. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
  38. Tokenlearner: Adaptive space-time tokenization for videos. Advances in Neural Information Processing Systems, 34:12786–12797, 2021.
  39. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020.
  40. Irwin Sobel. An isotropic 3x3 image gradient operator. Presentation at Stanford A.I. Project 1968, 1968.
  41. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  42. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018.
  43. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  44. Clipascene: Scene sketching with different types and levels of abstraction. arXiv preprint arXiv:2211.17256, 2022a.
  45. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022b.
  46. Form2fit: Learning shape priors for generalizable assembly from disassembly. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9404–9410. IEEE, 2020.
  47. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Priya Sundaresan (20 papers)
  2. Quan Vuong (41 papers)
  3. Jiayuan Gu (28 papers)
  4. Peng Xu (357 papers)
  5. Ted Xiao (40 papers)
  6. Sean Kirmani (18 papers)
  7. Tianhe Yu (36 papers)
  8. Michael Stark (7 papers)
  9. Ajinkya Jain (9 papers)
  10. Karol Hausman (56 papers)
  11. Dorsa Sadigh (162 papers)
  12. Jeannette Bohg (109 papers)
  13. Stefan Schaal (73 papers)
Citations (14)