Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Transformers in Reinforcement Learning (2301.03044v3)

Published 8 Jan 2023 in cs.LG and cs.AI

Abstract: Transformer has been considered the dominating neural architecture in NLP and CV, mostly under supervised settings. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. In this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.

A Survey on Transformers in Reinforcement Learning

Recent advancements in AI have been characterized by remarkable achievements in various domains including NLP and Computer Vision (CV), largely driven by the application of deep learning models such as Transformers. These models, having established dominance in supervised learning settings, are now being adapted to reinforcement learning (RL) environments, where unique challenges and novel design choices emerge due to the nature of RL research.

Overview

Reinforcement Learning provides a robust framework for sequential decision-making processes across numerous tasks. While the integration of deep neural networks has consistently enhanced learning-based control mechanisms, these approaches often grapple with issues of sample inefficiency when applied to real-world scenarios. A promising strategy to address these inefficiencies is through the introduction of inductive biases within the architecture of DRL (Deep Reinforcement Learning) agents, akin to practices in supervised learning. Despite the significant strides in DRL, architectural exploration remains less traversed compared to the extensive research undertaken in supervised learning with targets like CNNs (Convolutional Neural Networks) and RNNs (Recurrent Neural Networks).

The Transformer architecture, celebrated for modeling long-range dependencies and scalability, has galvanized the NLP and CV domains. Its application within RL is motivated by these successes, although adapting Transformers to the RL context imposes distinct challenges that differ considerably from supervised frameworks.

Developments in Transformer-based RL

The integration of Transformers within RL has evolved through varied methodologies aimed at leveraging its architectural strengths:

  1. Representation Learning: Initially, Transformers were employed for representation learning, focusing on local per-timestep sequences and temporal sequences. This involved encoding observations and exploiting self-attention mechanisms for relational reasoning, as seen in multi-entity and multi-agent settings—facilitating entity-based observations and improving policy learning capabilities.
  2. Model Learning: Transformers have also been utilized for building world models within model-based RL, where they account for history-based dynamics, thus handling partial observabilities more effectively. World models relying on Transformer architectures demonstrate superior data efficiency and significantly mitigate compounding prediction errors over extended rollouts.
  3. Sequential Decision-making: Viewing RL as a sequence modeling problem, Decision Transformer (DT) and Trajectory Transformer (TT) have pioneered the treatment of RL tasks as conditional sequence modeling—a framework that circumvents the inefficiencies of dynamic programming by directly modeling trajectories in offline settings.
  4. Generalist Agents: Beyond task-specific solutions, transformers have the potential to generalize policies across different tasks and domains. Efforts such as Multi-Game Decision Transformer (MGDT) and prompt-based models reflect the ambition to unify tasks via transformers—leveraging large datasets across multiple environments to train agents capable of diverse task execution.

Implications and Future Directions

The transformational impact of integrating Transformers in RL holds substantial promise, yet challenges persist:

  • Bridging Offline and Online Learning: Real-world RL applications necessitate strategies that effectively blend offline and online paradigms, improving adaptability and reducing dependency on extensive expert data.
  • Optimizing Transformer Architecture for RL: Given the computational intensity and memory footprint of traditional Transformers, tailored architectures designed specifically for decision-making tasks would greatly enhance efficiency and scalability.
  • Theoretical Insights into Combined Learning Approaches: While combining RL with supervised strategies yields intriguing results, comprehensive studies on theoretical integration would bolster understanding and guide practical implementations.

Conclusion

The journey of Transformer-based RL highlights profound potential and warrants ongoing exploration. The survey offers a meticulous taxonomy of developments and explicates practical insights and academic perspectives, outlining avenues for further innovations that could redefine RL strategies. Transformer's promise in delivering robust, scalable, and generalizable models across RL tasks continues to spur research and development efforts, keenly aligned with future AI advancements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (143)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  2. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  5. What matters for on-policy deep actor-critic methods? a large-scale study. In International conference on learning representations, 2020.
  6. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  7. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2019.
  8. Video pretraining (VPT): Learning to act by watching unlabeled online videos. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  9. Coberl: Contrastive bert for reinforcement learning. In International Conference on Learning Representations, 2021.
  10. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  11. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  12. Meta-reinforcement learning via language instructions. arXiv preprint arXiv:2209.04924, 2022.
  13. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  14. Transfer learning with causal counterfactual reasoning in decision transformers. arXiv preprint arXiv:2110.14355, 2021.
  15. When does return-conditioned supervised learning work for offline reinforcement learning? arXiv preprint arXiv:2206.01079, 2022.
  16. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  17. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  18. Unimask: Unified inference in sequential decision problems. arXiv preprint arXiv:2211.10869, 2022.
  19. Transdreamer: Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481, 2022.
  20. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  21. Transformers for one-shot visual imitation. In Conference on Robot Learning, pp.  2071–2084. PMLR, 2021.
  22. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  23. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5884–5888. IEEE, 2018.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  25. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  26. Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023.
  27. Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
  28. Implementation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations, 2019.
  29. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pp. 1407–1416. PMLR, 2018.
  30. Deep transformer q-networks for partially observable reinforcement learning. arXiv preprint arXiv:2206.01078, 2022.
  31. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  32. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  33. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
  34. Generalized decision transformer for offline hindsight information matching. arXiv preprint arXiv:2111.10364, 2021.
  35. Learning to reach goals via iterated supervised learning. arXiv preprint arXiv:1912.06088, 2019.
  36. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pp.  175–187. PMLR, 2023.
  37. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  38. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020.
  39. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021.
  40. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  41. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021.
  42. Deep recurrent q-learning for partially observable mdps. In 2015 aaai fall symposium series, 2015.
  43. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  44. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  45. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  46. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  47. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  48. Decision transformer under random frame dropping. arXiv preprint arXiv:2303.03391, 2023.
  49. Updet: Universal multi-agent rl via policy decoupling with transformers. In International Conference on Learning Representations, 2020.
  50. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4700–4708, 2017.
  51. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022a.
  52. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
  53. Going beyond linear transformers with recurrent fast weight programmers. Advances in Neural Information Processing Systems, 34:7703–7717, 2021.
  54. Gpt-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems. In International Conference on Learning Representations, 2022.
  55. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 32, 2019.
  56. Reinforcement learning as one big sequence modeling problem. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021.
  57. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
  58. Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022.
  59. Improving sample efficiency of value based models using attention and vision transformers. arXiv preprint arXiv:2202.00710, 2022.
  60. Think before you act: Decision transformers with internal working memory. arXiv preprint arXiv:2305.16338, 2023.
  61. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
  62. Contrastive decision transformers. In 6th Annual Conference on Robot Learning, 2022.
  63. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  64. My body is a cage: the role of morphology in graph-based incompatible control. arXiv preprint arXiv:2010.01856, 2020.
  65. In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
  66. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  67. Multi-game decision transformers. In Advances in Neural Information Processing Systems, 2022.
  68. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  69. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  70. Switch trajectory transformer with distributional value approximation for multi-task reinforcement learning. arXiv preprint arXiv:2203.07413, 2022.
  71. Distributional reward decomposition for reinforcement learning. Advances in neural information processing systems, 32, 2019.
  72. Goal-conditioned reinforcement learning: Problems and solutions. arXiv preprint arXiv:2201.08299, 2022.
  73. Working memory graphs. In International conference on machine learning, pp. 6404–6414. PMLR, 2020.
  74. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 1, 2021.
  75. Transformer in transformer as backbone for deep reinforcement learning. arXiv preprint arXiv:2212.14538, 2022.
  76. Luckeciano C Melo. Transformers are meta-reinforcement learners. In International Conference on Machine Learning, pp. 15340–15359. PMLR, 2022.
  77. Offline pre-trained multi-agent decision transformer: One big sequence model conquers all starcraftii tasks. arXiv preprint arXiv:2112.02845, 2021.
  78. Transformers are sample efficient world models. arXiv preprint arXiv:2209.00588, 2022.
  79. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
  80. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  81. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  82. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  83. OpenAI. Gpt-4 technical report, 2023.
  84. Can increasing input dimensionality improve deep reinforcement learning? In International Conference on Machine Learning, pp. 7424–7433. PMLR, 2020.
  85. Training larger networks for deep reinforcement learning. arXiv preprint arXiv:2102.07920, 2021.
  86. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  87. Vector quantized models for planning. In International Conference on Machine Learning, pp. 8302–8313. PMLR, 2021.
  88. Efficient transformers in reinforcement learning using actor-learner distillation. arXiv preprint arXiv:2104.01655, 2021.
  89. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pp. 7487–7498. PMLR, 2020.
  90. You can’t count on luck: Why decision transformers fail in stochastic environments. arXiv preprint arXiv:2205.15967, 2022.
  91. Planning with large language models via corrective re-prompting. arXiv preprint arXiv:2211.09935, 2022.
  92. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  93. Can wikipedia help offline reinforcement learning? arXiv preprint arXiv:2201.12122, 2022.
  94. Transformer-based world models are happy with 100k interactions. arXiv preprint arXiv:2303.07109, 2023.
  95. Universal value function approximators. In International conference on machine learning, pp. 1312–1320. PMLR, 2015.
  96. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  97. A generalist dynamics model for control. arXiv preprint arXiv:2305.10912, 2023.
  98. Masked world models for visual control. In Conference on Robot Learning, pp.  1332–1344. PMLR, 2022a.
  99. Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning, pp. 19561–19579. PMLR, 2022b.
  100. Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. arXiv preprint arXiv:2206.11251, 2022.
  101. Starformer: Transformer with state-action-reward representations for visual reinforcement learning. In European Conference on Computer Vision, pp.  462–479. Springer, 2022.
  102. How crucial is transformer in decision transformer? arXiv preprint arXiv:2211.14655, 2022.
  103. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  104. D2rl: Deep dense architectures in reinforcement learning. arXiv preprint arXiv:2010.09163, 2020.
  105. Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871, 2022a.
  106. Context-aware language modeling for goal-oriented dialogue systems. arXiv preprint arXiv:2204.10198, 2022b.
  107. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  108. Plate: Visually-grounded planning with transformers in procedural tasks. IEEE Robotics and Automation Letters, 7(2):4924–4930, 2022.
  109. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017.
  110. Shiro Takagi. On the effect of pre-training for transformer in different modality on offline reinforcement learning. arXiv preprint arXiv:2211.09817, 2022.
  111. The sensory neuron as a transformer: Permutation-invariant neural networks for reinforcement learning. Advances in Neural Information Processing Systems, 34:22574–22587, 2021.
  112. Evaluating vision transformer methods for deep reinforcement learning from pixels. arXiv preprint arXiv:2204.04905, 2022.
  113. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  114. Human-timescale adaptation in an open-ended task space. arXiv preprint arXiv:2301.07608, 2023.
  115. Creating multimodal interactive agents with imitation and self-supervised learning. arXiv preprint arXiv:2112.03763, 2021.
  116. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018.
  117. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  118. Multi-environment pretraining enables transfer to action limited datasets. arXiv preprint arXiv:2211.13337, 2022.
  119. Chai: A chatbot ai for task-oriented dialogue with offline reinforcement learning. arXiv preprint arXiv:2204.08426, 2022.
  120. Addressing optimism bias in sequence modeling for reinforcement learning. In International Conference on Machine Learning, pp. 22270–22283. PMLR, 2022.
  121. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  122. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  123. Bootstrapped transformer for offline reinforcement learning. arXiv preprint arXiv:2206.08569, 2022.
  124. Multi-agent multi-game entity transformer. 2023b.
  125. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023c.
  126. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. PMLR, 2016.
  127. Multi-agent reinforcement learning is a sequence modeling problem. arXiv preprint arXiv:2205.14953, 2022.
  128. Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning. arXiv preprint arXiv:2305.15486, 2023.
  129. Pretraining in deep reinforcement learning: A survey. arXiv preprint arXiv:2211.03959, 2022.
  130. Prompting decision transformer for few-shot policy generalization. In International Conference on Machine Learning, pp. 24631–24645. PMLR, 2022.
  131. Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline rl. arXiv preprint arXiv:2209.03993, 2022.
  132. Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435, 2022a.
  133. Chain of thought imitation with procedure cloning. Advances in Neural Information Processing Systems, 35:36366–36381, 2022b.
  134. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  135. Towards playing full moba games with deep reinforcement learning. Advances in Neural Information Processing Systems, 33:621–632, 2020a.
  136. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  6672–6679, 2020b.
  137. The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021a.
  138. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021b.
  139. Deep reinforcement learning with relational inductive biases. In International conference on learning representations, 2018.
  140. Online decision transformer. arXiv preprint arXiv:2202.05607, 2022.
  141. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  11106–11115, 2021.
  142. Scaling pareto-efficient decision making via offline multi-objective rl. arXiv preprint arXiv:2305.00567, 2023.
  143. Long-short transformer: Efficient transformers for language and vision. Advances in neural information processing systems, 34:17723–17736, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenzhe Li (15 papers)
  2. Hao Luo (112 papers)
  3. Zichuan Lin (16 papers)
  4. Chongjie Zhang (68 papers)
  5. Zongqing Lu (88 papers)
  6. Deheng Ye (50 papers)
Citations (49)
Youtube Logo Streamline Icon: https://streamlinehq.com