Zero-Shot Reinforcement Learning from Low Quality Data (2309.15178v3)
Abstract: Zero-shot reinforcement learning (RL) promises to provide agents that can perform any task in an environment after an offline, reward-free pre-training phase. Methods leveraging successor measures and successor features have shown strong performance in this setting, but require access to large heterogenous datasets for pre-training which cannot be expected for most real problems. Here, we explore how the performance of zero-shot RL methods degrades when trained on small homogeneous datasets, and propose fixes inspired by conservatism, a well-established feature of performant single-task offline RL algorithms. We evaluate our proposals across various datasets, domains and tasks, and show that conservative zero-shot RL algorithms outperform their non-conservative counterparts on low quality datasets, and perform no worse on high quality datasets. Somewhat surprisingly, our proposals also outperform baselines that get to see the task during training. Our code is available via https://enjeeneer.io/projects/zero-shot-rl/ .
- Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
- Model-based offline planning. arXiv preprint arXiv:2008.05556, 2020.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. arXiv preprint arXiv:2202.11566, 2022.
- Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30, 2017.
- Learning successor states and goal-dependent values: A mathematical viewpoint. arXiv preprint arXiv:2101.07123, 2021.
- Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
- Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. arXiv preprint arXiv:2309.10150, 2023.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020a.
- Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33:18353–18363, 2020b.
- Dayan, P. Improving generalization for temporal difference learning: The successor representation. Neural computation, 5(4):613–624, 1993.
- Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901, 2019.
- Efron, B. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pp. 569–593. Springer, 1992.
- Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
- Continuous doubly constrained batch reinforcement learning. Advances in Neural Information Processing Systems, 34:11260–11273, 2021.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
- Generalized decision transformer for offline hindsight information matching. arXiv preprint arXiv:2111.10364, 2021.
- Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3647–3655, 2019.
- Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pp. 3682–3691. PMLR, 2021.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870, 2018.
- Fast task inference with variational intrinsic successor features. arXiv preprint arXiv:1906.05030, 2019.
- Array programming with numpy. Nature, 585(7825):357–362, 2020.
- Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9(03):90–95, 2007.
- Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
- Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
- Low emission building control with zero-shot reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 14259–14267, 2023.
- Reward-free exploration for reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, pp. 4870–4879, 13–18 Jul 2020.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021.
- Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
- Koren, Y. On spectral graph drawing. In International Computing and Combinatorics Conference, pp. 496–508. Springer, 2003.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021.
- Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019b.
- Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
- Offline q-learning on diverse multi-task data both scales and generalizes. arXiv preprint arXiv:2211.15144, 2022.
- Multi-game decision transformers. Advances in neural information processing systems, 35, 2022.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Continuous control with deep reinforcement learning. In ICLR (Poster), 2016.
- Aps: Active pretraining with successor features. In International Conference on Machine Learning, pp. 6736–6747. PMLR, 2021a.
- Aps: Active pretraining with successor features. In International Conference on Machine Learning, pp. 6736–6747. PMLR, 2021b.
- Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019.
- Mildly conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:1711–1724, 2022.
- Offline reinforcement learning with value-based episodic memory. arXiv preprint arXiv:2110.09796, 2021a.
- Conservative offline distributional reinforcement learning. Advances in Neural Information Processing Systems, 34:19235–19247, 2021b.
- Deployment-efficient reinforcement learning via model-based offline optimization. arXiv preprint arXiv:2006.03647, 2020.
- McKinney, W. et al. pandas: a foundational python library for data analysis and statistics. Python for high performance and scientific computing, 14(9):1–9, 2011.
- Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in neural information processing systems, 32, 2019.
- Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479, 2023.
- Automatic differentiation in pytorch. 2017.
- Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pp. 2778–2787. PMLR, 2017.
- Self-supervised exploration via disagreement. In International conference on machine learning, pp. 5062–5071. PMLR, 2019.
- Weighted policy constraints for offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 9435–9443, 2023.
- Off-policy temporal-difference learning with function approximation. In ICML, pp. 417–424, 2001.
- Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control, pp. 1154–1168. PMLR, 2021.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Rambo-rl: Robust adversarial model-based offline reinforcement learning. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 16082–16097. Curran Associates, Inc., 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Samuel, A. L. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3):210–229, 1959.
- Schaal, S. Learning from demonstration. Advances in neural information processing systems, 9, 1996.
- Schaal, S. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
- Universal value function approximators. In International conference on machine learning, pp. 1312–1320. PMLR, 2015.
- How crucial is transformer in decision transformer? arXiv preprint arXiv:2211.14655, 2022.
- Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
- Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44, 1988.
- An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):2603–2631, 2016.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 34:13–23, 2021.
- Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2023.
- Exponentially weighted imitation learning for batched historical data. Advances in Neural Information Processing Systems, 31, 2018.
- Supported policy optimization for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:31278–31291, 2022.
- The laplacian in rl: Learning representations with efficient approximations. arXiv preprint arXiv:1810.04586, 2018.
- Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
- Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140, 2021.
- Prompting decision transformer for few-shot policy generalization. In Proceedings of the 39th International Conference on Machine Learning, pp. 24631–24645, 17–23 Jul 2022.
- Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline RL. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 38989–39007, 23–29 Jul 2023.
- Rorl: Robust offline reinforcement learning via conservative smoothing. Advances in Neural Information Processing Systems, 35:23851–23866, 2022a.
- A behavior regularized implicit policy for offline reinforcement learning. arXiv preprint arXiv:2202.09673, 2022b.
- Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. arXiv preprint arXiv:2201.13425, 2022.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
- Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
- Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640, 2021.
- Online decision transformer. In international conference on machine learning, pp. 27042–27059. PMLR, 2022.