A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning (2312.07685v1)
Abstract: Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.
- An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, 104–114. PMLR.
- Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in Neural Information Processing Systems, 35: 28955–28971.
- Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34: 7436–7447.
- Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. In International Conference on Learning Representations.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Openai gym. arXiv preprint arXiv:1606.01540.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics.
- engine Contributors, D. 2021. DI-engine: OpenDILab Decision Intelligence Engine. https://github.com/opendilab/DI-engine.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34: 20132–20145.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, 2052–2062. PMLR.
- Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning. In Conference on Robot Learning, 1025–1037. PMLR.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861–1870. PMLR.
- Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
- Learning attractor landscapes for learning motor primitives. Advances in neural information processing systems, 15.
- Learning from limited demonstrations. Advances in Neural Information Processing Systems, 26.
- Offline Reinforcement Learning with Implicit Q-Learning. In International Conference on Learning Representations.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
- Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems, 33: 1179–1191.
- Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, 1702–1712. PMLR.
- Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning. In PRL Workshop at IJCAI.
- Ace: Cooperative multi-agent q-learning with bidirectional action-dependency. In Proceedings of the AAAI conference on artificial intelligence, volume 37, 8536–8544.
- Masked Pretraining for Multi-Agent Decision Making. arXiv preprint arXiv:2310.11846.
- Combining self-supervised learning and imitation for vision-based rope manipulation. In 2017 IEEE international conference on robotics and automation (ICRA), 2146–2153. IEEE.
- Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32: 8026–8037.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
- A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11: 3137–3181.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 5026–5033. IEEE.
- Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors. Science and Systems (RSS).
- Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33: 14129–14142.
- Policy Expansion for Bridging Offline-to-Online Reinforcement Learning. In The Eleventh International Conference on Learning Representations.
- Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning.
- Online decision transformer. In international conference on machine learning, 27042–27059. PMLR.
- Dexterous manipulation with deep reinforcement learning: Efficient, general, and low-cost. In 2019 International Conference on Robotics and Automation (ICRA), 3651–3657. IEEE.
- Yinmin Zhang (11 papers)
- Jie Liu (492 papers)
- Chuming Li (19 papers)
- Yazhe Niu (16 papers)
- Yaodong Yang (169 papers)
- Yu Liu (786 papers)
- Wanli Ouyang (358 papers)