Supervised Pretraining Can Learn In-Context Reinforcement Learning (2306.14892v1)
Abstract: Large transformer models trained on diverse datasets have shown a remarkable ability to learn in-context, achieving high few-shot performance on tasks they were not explicitly trained to solve. In this paper, we study the in-context learning capabilities of transformers in decision-making problems, i.e., reinforcement learning (RL) for bandits and Markov decision processes. To do so, we introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action given a query state and an in-context dataset of interactions, across a diverse set of tasks. This procedure, while simple, produces a model with several surprising capabilities. We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline, despite not being explicitly trained to do so. The model also generalizes beyond the pretraining distribution to new tasks and automatically adapts its decision-making strategies to unknown structure. Theoretically, we show DPT can be viewed as an efficient implementation of Bayesian posterior sampling, a provably sample-efficient RL algorithm. We further leverage this connection to provide guarantees on the regret of the in-context algorithm yielded by DPT, and prove that it can learn faster than algorithms used to generate the pretraining data. These results suggest a promising yet simple path towards instilling strong in-context decision-making abilities in transformers.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
- Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv:2210.14215, 2022.
- Prompting decision transformer for few-shot policy generalization. In International Conference on Machine Learning, pages 24631–24645. PMLR, 2022.
- Hyper-decision transformer for efficient online policy adaptation. arXiv preprint arXiv:2304.08487, 2023.
- Reinforcement learning: An introduction. MIT press, 2018.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.
- Metalearning. Scholarpedia, 5(6):4650, 2010.
- Learning a synaptic learning rule. Citeseer, 1990.
- One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4019–4026. IEEE, 2016.
- Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. arXiv preprint arXiv:1803.11347, 2018.
- A model-based approach for sample-efficient multi-task reinforcement learning. arXiv preprint arXiv:1907.04964, 2019.
- Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331–5340. PMLR, 2019.
- Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424, 2019.
- Varibad: A very good method for bayes-adaptive deep rl via meta-learning. arXiv preprint arXiv:1910.08348, 2019.
- Decoupling exploration and exploitation for meta-reinforcement learning without sacrifices. In International conference on machine learning, pages 6925–6935. PMLR, 2021.
- Using options for knowledge transfer in reinforcement learning. Technical report, Citeseer, 1999.
- Meta-reinforcement learning of structured exploration strategies. Advances in neural information processing systems, 31, 2018.
- Learning options via compression. Advances in Neural Information Processing Systems, 35:21184–21199, 2022.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
- Promp: Proximal meta-policy search. arXiv preprint arXiv:1810.06784, 2018.
- Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
- Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141, 2017.
- Structured state space models for in-context reinforcement learning. arXiv preprint arXiv:2303.03982, 2023.
- Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
- Multi-game decision transformers. Advances in Neural Information Processing Systems, 35:27921–27936, 2022.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
- When does return-conditioned supervised learning work for offline reinforcement learning? arXiv preprint arXiv:2206.01079, 2022.
- Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435, 2022.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
- Provably good batch off-policy reinforcement learning without great exploration. Advances in neural information processing systems, 33:1264–1274, 2020.
- Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Advances in Neural Information Processing Systems, 35:18267–18281, 2022.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR, 2019.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
- Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
- Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396, 2020.
- Off-policy policy gradient with state distribution correction. UAI, 2019.
- Focal: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization. arXiv preprint arXiv:2010.01112, 2020.
- Offline meta-reinforcement learning with advantage weighting. In International Conference on Machine Learning, pages 7780–7791. PMLR, 2021.
- Offline meta reinforcement learning–identifiability challenges and effective data collection strategies. Advances in Neural Information Processing Systems, 34:4607–4618, 2021.
- Offline meta-reinforcement learning with online self-supervision. In International Conference on Machine Learning, pages 17811–17829. PMLR, 2022.
- A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
- William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- On the optimality of batch policy optimization algorithms. In International Conference on Machine Learning, pages 11362–11371. PMLR, 2021.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
- Maxime Chevalier-Boisvert. Miniworld: Minimalistic 3d environment for rl and robotics research, 2018.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
- Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
- Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- General-purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458, 2022.
- On the effect of pretraining corpora on in-context learning by a large-scale language model. arXiv preprint arXiv:2204.13509, 2022.
- Transformers as algorithms: Generalization and implicit model selection in in-context learning. arXiv preprint arXiv:2301.07067, 2023.
- The learnability of in-context learning. arXiv preprint arXiv:2303.07895, 2023.
- A mechanism for sample-efficient in-context learning for sparse retrieval tasks. arXiv preprint arXiv:2305.17040, 2023.
- Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):1–24, 2017.
- Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943–950, 2000.
- Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30, 2017.
- Ensemble sampling. Advances in neural information processing systems, 30, 2017.
- Deep exploration via bootstrapped dqn. Advances in neural information processing systems, 29, 2016.
- Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Stable-baselines3: Reliable reinforcement learning implementations. The Journal of Machine Learning Research, 22(1):12348–12355, 2021.
- Varibad: A very good method for bayes-adaptive deep rl via meta-learning. In International Conference on Learning Representation (ICLR), 2020.
- Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
- Jonathan N. Lee (11 papers)
- Annie Xie (21 papers)
- Aldo Pacchiano (72 papers)
- Yash Chandak (32 papers)
- Chelsea Finn (264 papers)
- Ofir Nachum (64 papers)
- Emma Brunskill (86 papers)