Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QueST: Self-Supervised Skill Abstractions for Learning Continuous Control (2407.15840v3)

Published 22 Jul 2024 in cs.RO

Abstract: Generalization capabilities, or rather a lack thereof, is one of the most important unsolved problems in the field of robot learning, and while several large scale efforts have set out to tackle this problem, unsolved it remains. In this paper, we hypothesize that learning temporal action abstractions using latent variable models (LVMs), which learn to map data to a compressed latent space and back, is a promising direction towards low-level skills that can readily be used for new tasks. Although several works have attempted to show this, they have generally been limited by architectures that do not faithfully capture shareable representations. To address this we present Quantized Skill Transformer (QueST), which learns a larger and more flexible latent encoding that is more capable of modeling the breadth of low-level skills necessary for a variety of tasks. To make use of this extra flexibility, QueST imparts causal inductive bias from the action sequence data into the latent space, leading to more semantically useful and transferable representations. We compare to state-of-the-art imitation learning and LVM baselines and see that QueST's architecture leads to strong performance on several multitask and few-shot learning benchmarks. Further results and videos are available at https://quest-model.github.io/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In 7th Annual Conference on Robot Learning, 2023.
  2. Musiclm: Generating music from text, 2023.
  3. Do as i can, not as i say: Grounding language in robotic affordances, 2022.
  4. Autort: Embodied foundation models for large scale orchestration of robotic agents, 2024.
  5. Opal: Offline primitive discovery for accelerating offline reinforcement learning, 2021.
  6. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477, 2020. URL https://arxiv.org/abs/2006.11477.
  7. Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
  8. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  9. Audiolm: a language modeling approach to audio generation, 2023.
  10. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  11. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  12. Latte: Language trajectory transformer. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7287–7294. IEEE, 2023.
  13. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  14. Playfusion: Skill acquisition via diffusion from language-annotated play. In Conference on Robot Learning, pages 2012–2029. PMLR, 2023.
  15. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
  16. From play to policy: Conditional behavior generation from uncurated robot data. arXiv preprint arXiv:2210.10047, 2022.
  17. Jukebox: A generative model for music, 2020.
  18. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  19. One-shot imitation learning, 2017.
  20. Bridge data: Boosting generalization of robotic skills with cross-domain datasets, 2021.
  21. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  22. One-shot visual imitation learning via meta-learning, 2017.
  23. Implicit behavioral cloning. In Conference on Robot Learning, pages 158–168. PMLR, 2022.
  24. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024.
  25. Scaling up and distilling down: Language-guided robot skill acquisition. In Proceedings of the 2023 Conference on Robot Learning, 2023.
  26. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  27. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
  28. Strictly batch imitation learning by energy-based distribution matching. Advances in Neural Information Processing Systems, 33:7354–7365, 2020.
  29. H-gap: Humanoid control with a generalist planner. arXiv preprint arXiv:2312.02682, 2023.
  30. Droid: A large-scale in-the-wild robot manipulation dataset. 2024.
  31. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  32. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  33. Autoregressive image generation using residual quantization, 2022.
  34. Behavior generation with latent actions. arXiv preprint arXiv:2403.03181, 2024.
  35. End-to-end training of deep visuomotor policies, 2016.
  36. Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution. arXiv preprint arXiv:2312.11598, 2023.
  37. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36, 2024.
  38. Visual instruction tuning, 2023.
  39. Action-quantized offline reinforcement learning for robotic skill learning. In Conference on Robot Learning, pages 1348–1361. PMLR, 2023.
  40. Learning latent plans from play. Conference on Robot Learning (CoRL), 2019. URL https://arxiv.org/abs/1903.01973.
  41. Towards more generalizable one-shot visual imitation learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 2434–2444. IEEE, 2022.
  42. What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298, 2021.
  43. E. Mansimov and K. Cho. Simple nearest neighbor policy method for continuous control tasks, 2018. URL https://openreview.net/forum?id=ByL48G-AW.
  44. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023.
  45. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024.
  46. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
  47. N. D. Palo and E. Johns. Dinobot: Robot manipulation via retrieval and alignment with vision foundation models. In IEEE International Conference on Robotics and Automation (ICRA), 2024.
  48. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  49. Accelerating reinforcement learning with learned skill priors. In Conference on robot learning, pages 188–204. PMLR, 2021.
  50. D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
  51. Improving language understanding by generative pre-training. 2018.
  52. Language models are unsupervised multitask learners. 2019.
  53. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  54. A generalist agent, 2022.
  55. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  56. Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
  57. Skill-based model-based reinforcement learning. arXiv preprint arXiv:2207.07560, 2022.
  58. Parrot: Data-driven behavioral priors for reinforcement learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ysuv-WOFeKR.
  59. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/8d55a249e6baa5c06772297520da2051-Paper.pdf.
  60. Causal imitation learning under temporally correlated noise, 2022.
  61. G. Team. Gemini: A family of highly capable multimodal models, 2024.
  62. Llama 2: Open foundation and fine-tuned chat models, 2023.
  63. Wavenet: A generative model for raw audio. In Arxiv, 2016. URL https://arxiv.org/abs/1609.03499.
  64. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  65. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  66. Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery. arXiv preprint arXiv:2311.02058, 2023.
  67. Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In 7th Annual Conference on Robot Learning, 2023.
  68. Vector-quantized image modeling with improved VQGAN. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=pfNyExj7z2.
  69. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  70. 3d diffusion policy. arXiv preprint arXiv:2403.03954, 2024.
  71. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
  72. Learning fine-grained bimanual manipulation with low-cost hardware, 2023.
  73. Prise: Learning temporal action abstractions as a sequence compression problem. arXiv preprint arXiv:2402.10450, 2024a.
  74. Taco: Temporal latent action-driven contrastive loss for visual reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024b.
  75. Efficient planning in a compact latent action space. In 3rd Offline RL Workshop: Offline RL as a ”Launchpad”, 2022. URL https://openreview.net/forum?id=pVBETTS2av.
  76. Segment everything everywhere all at once. Advances in Neural Information Processing Systems, 36, 2024.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com