Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions (2309.10150v2)

Published 18 Sep 2023 in cs.RO, cs.AI, and cs.LG

Abstract: In this work, we present a scalable reinforcement learning method for training multi-task policies from large offline datasets that can leverage both human demonstrations and autonomously collected data. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We therefore refer to the method as Q-Transformer. By discretizing each action dimension and representing the Q-value of each action dimension as separate tokens, we can apply effective high-capacity sequence modeling techniques for Q-learning. We present several design decisions that enable good performance with offline RL training, and show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite. The project's website and videos can be found at https://qtransformer.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  2. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022.
  3. Metamorph: Learning universal controllers with transformers. arXiv preprint arXiv:2203.11931, 2022.
  4. Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451, 2022.
  5. Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. arXiv preprint arXiv:2206.11251, 2022.
  6. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  7. Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021.
  8. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  9. Motion reasoning for goal-based imitation learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 4878–4884. IEEE, 2020.
  10. Gnm: A general navigation model to drive any robot. arXiv preprint arXiv:2210.03370, 2022.
  11. Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, pages 651–673. PMLR, 2018.
  12. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In 5th Annual Conference on Robot Learning, 2021.
  13. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv preprint arXiv:1909.12200, 2019.
  14. MT-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv, 2021.
  15. Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022.
  16. L. P. Kaelbling. Learning to achieve goals. In R. Bajcsy, editor, IJCAI, pages 1094–1099. Morgan Kaufmann, 1993. ISBN 1-55860-300-X.
  17. Actionable models: Unsupervised offline reinforcement learning of robotic skills. arXiv preprint arXiv:2104.07749, 2021.
  18. Universal value function approximators. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1312–1320, Lille, France, 07–09 Jul 2015. PMLR.
  19. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
  20. Scene memory transformer for embodied agents in long-horizon tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 538–547, 2019.
  21. On evaluation of embodied navigation agents. CoRR, abs/1807.06757, 2018. URL http://dblp.uni-trier.de/db/journals/corr/corr1807.html#abs-1807-06757.
  22. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. CoRR, abs/1711.07280, 2017. URL http://arxiv.org/abs/1711.07280.
  23. Learning to navigate in complex environments. CoRR, abs/1611.03673, 2016. URL http://dblp.uni-trier.de/db/journals/corr/corr1611.html#MirowskiPVSBBDG16.
  24. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, pages 3357–3364. IEEE, 2017. ISBN 978-1-5090-4633-1. URL http://dblp.uni-trier.de/db/conf/icra/icra2017.html#ZhuMKLGFF17.
  25. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  26. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  27. A workflow for offline model-free robotic reinforcement learning. arXiv preprint arXiv:2109.10813, 2021.
  28. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
  29. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  30. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  31. Monte carlo augmented actor-critic for sparse reward deep reinforcement learning from suboptimal demonstrations. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=FLzTj4ia8BN.
  32. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  33. Multi-game decision transformers. arXiv preprint arXiv:2205.15241, 2022.
  34. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
  35. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  36. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  37. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396, 2020.
  38. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11761–11771, 2019.
  39. Offline reinforcement learning with fisher divergence critic regularization. arXiv preprint arXiv:2103.08050, 2021a.
  40. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021b.
  41. Critic regularized regression. arXiv preprint arXiv:2006.15134, 2020.
  42. S. Fujimoto and S. S. Gu. A minimalist approach to offline reinforcement learning. arXiv preprint arXiv:2106.06860, 2021.
  43. Bail: Best-action imitation learning for batch deep reinforcement learning, 2019. URL https://arxiv.org/abs/1910.12179.
  44. Generalized decision transformer for offline hindsight information matching. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=CAjxVodl_v.
  45. GPT-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=qaxhBG1UUaS.
  46. Offline pre-trained multi-agent decision transformer, 2022. URL https://openreview.net/forum?id=W08IqLMlMer.
  47. Density estimation for conservative q-learning, 2022. URL https://openreview.net/forum?id=liV-Re74fK.
  48. Robust imitation learning from corrupted demonstrations, 2022. URL https://openreview.net/forum?id=UECzHrGio7i.
  49. What matters in learning from offline human demonstrations for robot manipulation. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=JrsfBJtDFdI.
  50. When should we prefer offline reinforcement learning over behavioral cloning? arXiv preprint arXiv:2204.05618, 2022.
  51. Offline rl with realistic datasets: Heteroskedasticity and support constraints. arXiv preprint arXiv:2211.01052, 2022.
  52. DASCO: Dual-generator adversarial support constrained offline reinforcement learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=jBTQGGy9qA-.
  53. Distance-sensitive offline reinforcement learning. arXiv preprint arXiv:2205.11027, 2022.
  54. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  55. Y. Zhang and J. Chai. Hierarchical task learning from language instructions with unified transformers and self-monitoring. arXiv preprint arXiv:2106.03427, 2021.
  56. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952, 2021.
  57. Lancon-learn: Learning with language to enable generalization in multi-task manipulation. IEEE Robotics and Automation Letters, 7(2):1635–1642, 2021.
  58. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2021.
  59. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022.
  60. Reinforcement learning as one big sequence modeling problem. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021.
  61. Reward-conditioned policies. arXiv preprint arXiv:1912.13465, 2019.
  62. Training agents using upside-down reinforcement learning. arXiv preprint arXiv:1912.02877, 2019.
  63. When does return-conditioned supervised learning work for offline reinforcement learning? arXiv preprint arXiv:2206.01079, 2022.
  64. Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline rl, 2023.
  65. Discrete sequential prediction of continuous actions for deep rl. CoRR, abs/1705.05035, 2017. URL http://dblp.uni-trier.de/db/journals/corr/corr1705.html#MetzIJD17.
  66. S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. 9(8):1735–1780, 1997.
  67. Reinforcement learning: An introduction. Second edition, 2018.
  68. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
  69. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  70. M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  71. R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  72. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  73. Aw-opt: Learning robotic skills with imitation andreinforcement at scale. In 5th Annual Conference on Robot Learning, 2021.
  74. S. Dasari and A. Gupta. Transformers for one-shot visual imitation. In Conference on Robot Learning, pages 2071–2084. PMLR, 2021.
  75. C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
  76. Thinking while moving: Deep reinforcement learning with concurrent control. arXiv preprint arXiv:2004.06089, 2020.
  77. Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (25)
  1. Yevgen Chebotar (28 papers)
  2. Quan Vuong (41 papers)
  3. Alex Irpan (23 papers)
  4. Karol Hausman (56 papers)
  5. Fei Xia (111 papers)
  6. Yao Lu (212 papers)
  7. Aviral Kumar (74 papers)
  8. Tianhe Yu (36 papers)
  9. Alexander Herzog (32 papers)
  10. Karl Pertsch (35 papers)
  11. Keerthana Gopalakrishnan (14 papers)
  12. Julian Ibarz (26 papers)
  13. Ofir Nachum (64 papers)
  14. Sumedh Sontakke (4 papers)
  15. Grecia Salazar (3 papers)
  16. Huong T Tran (1 paper)
  17. Jodilyn Peralta (4 papers)
  18. Clayton Tan (4 papers)
  19. Deeksha Manjunath (2 papers)
  20. Jaspiar Singht (1 paper)
Citations (64)

Summary

Q-Transformer: Autoregressive Q-Functions for Scalable Offline Reinforcement Learning

The paper introduces a novel method termed "Q-Transformer," which leverages transformer architectures to represent Q-functions in reinforcement learning (RL). This work focuses on scalable offline RL tailored for training policies from extensive multi-task datasets. This approach synthesizes Q-learning with leading sequence modeling techniques, promoting the execution of diverse manipulation tasks by robotic systems both from human demonstrations and autonomously collected data.

Methodology

Q-Transformer Approach:

  1. Autoregressive Modeling: In contrast to traditional methods, Q-Transformer discretizes each action dimension, treating them as separate tokens. This technique underpins autoregressive Q-functions that are beneficial for representing high-capacity sequence models, thereby enabling the training of complex policies.
  2. Temporal Difference Backups: Offline RL underlies the use of temporal difference backups for Q-function estimation. Transformers managed the task of transforming these backups into sequence models.
  3. Conservative Q-Function Regularizer: Q-Transformer incorporates a conservative regularizer which strategically controls distributional shifts in offline RL. This regularizer effectively penalizes out-of-distribution actions by minimizing their Q-values, maintaining robustness in value estimation.
  4. Monte Carlo and Multi-Step Returns: The implementation benefits from the inclusion of Monte Carlo returns while utilizing nn-step updates. These augmentations facilitate more efficient learning dynamics by expediting value propagation.

Experimental Evaluation

The framework was tested on large-scale, real-world robotic manipulation datasets involving multi-task challenges such as picking, placing, and navigating objects. The dataset includes 38,000 successful demonstrations, supplemented by 20,000 episodes of failed attempts demonstrating exploratory behavior from robots.

Results: The Q-Transformer exhibited superior performance compared to prior imitation learning algorithms and alternative offline RL methods. The contrasting success rates underscore the architecture’s ability to effectively synthesize and optimize information gleaned from both demonstration and autonomously collected datasets.

Implications and Future Directions

Practical Implications:

  • Robust Policy Learning: Q-Transformer promises an efficient learning methodology that can train policies capable of outperforming human teleoperators, enhancing the proficiency of robotic systems.
  • Scalability Across Environments: Its efficacy in handling diverse datasets promotes adaptability across various real-world tasks and environments, emphasizing deployment potential in robotics.

Theoretical Implications:

  • Transforming RL with High-Capacity Models: The successful integration of transformers into RL environments incites discussions on their broader applicability in other RL paradigms.
  • Advanced Conservative Mechanisms: The algorithm’s unique approach in mitigating distributional shifts presents a playful avenue for other RL methods aiming to achieve stability in offline settings.

Future Work:

  • Extended Real-World Applications: Further exploration is warranted in evaluating Q-Transformer’s ability to scale in high-dimensional control scenarios, such as humanoid robots, which may involve intricate action dynamics.
  • Online Adjustment and Finetuning: Investigating methods for online finetuning within the Q-Transformer framework could foster real-time policy improvements, accentuating autonomous capabilities.

The strategic melding of sequence modeling with Q-learning delineates a promising frontier for reinforcement learning, especially in the nuanced domain of robotic policy synthesis via extensive offline datasets. The advances poised by Q-Transformer ignite profound discussions on the architectural designs that can revolutionize RL frameworks in the context of robotics and beyond.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com