SAMG: Offline-to-Online Reinforcement Learning via State-Action-Conditional Offline Model Guidance (2410.18626v2)
Abstract: Offline-to-online (O2O) reinforcement learning (RL) pre-trains models on offline data and refines policies through online fine-tuning. However, existing O2O RL algorithms typically require maintaining the tedious offline datasets to mitigate the effects of out-of-distribution (OOD) data, which significantly limits their efficiency in exploiting online samples. To address this deficiency, we introduce a new paradigm for O2O RL called State-Action-Conditional Offline \Model Guidance (SAMG). It freezes the pre-trained offline critic to provide compact offline understanding for each state-action sample, thus eliminating the need for retraining on offline data. The frozen offline critic is incorporated with the online target critic weighted by a state-action-adaptive coefficient. This coefficient aims to capture the offline degree of samples at the state-action level, and is updated adaptively during training. In practice, SAMG could be easily integrated with Q-function-based algorithms. Theoretical analysis shows good optimality and lower estimation error. Empirically, SAMG outperforms state-of-the-art O2O RL algorithms on the D4RL benchmark.
- Playing hard exploration games by watching YouTube. In Advances in Neural Information Processing Systems.
- Never Give Up: Learning Directed Exploration Strategies. In International Conference on Learning Representations.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35: 24639–24654.
- Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning. arXiv preprint arXiv:2211.11802.
- Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
- Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33: 18353–18363.
- Dvoretzky, A. 1959. A theorem on convex bodies and applications to Banach spaces. Proceedings of the National Academy of Sciences, 45(2): 223–226.
- D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34: 20132–20145.
- Off-Policy Deep Reinforcement Learning without Exploration. In Proceedings of the 36th International Conference on Machine Learning.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861–1870. PMLR.
- Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence.
- A review of deep transfer learning and recent advancements. Technologies, 11(2): 40.
- When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 12498–12509.
- Policy optimization with demonstrations. In International conference on machine learning, 2469–2478. PMLR.
- A theorem on contraction mappings. J. Math. Anal. Appl, 28: 326–329.
- Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27.
- Offline Reinforcement Learning with Implicit Q-Learning. In International Conference on Learning Representations.
- Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 11761–11771.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191.
- Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, 1702–1712. PMLR.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
- Guided Online Distillation: Promoting Safe Reinforcement Learning by Offline Demonstration.
- Steve-1: A generative model for text-to-behavior in minecraft. Advances in Neural Information Processing Systems, 36.
- Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In International Conference on Learning Representations.
- Understanding Posterior Collapse in Generative Latent Variable Models.
- Mildly Conservative Q-Learning for Offline Reinforcement Learning. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
- Moore: Model-based offline-to-online reinforcement learning. arXiv preprint arXiv:2201.10070.
- Fine-tuning Offline Policies with Optimistic Action Selection. In Deep Reinforcement Learning Workshop NeurIPS 2022.
- Müllner, D. 2011. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378.
- Accelerating Online Reinforcement Learning with Offline Datasets. arXiv preprint arXiv:2006.09359.
- Overcoming Exploration in Reinforcement Learning with Demonstrations. In IEEE International Conference on Robotics and Automation.
- Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems.
- MOTO: Offline to Online Fine-tuning for Model-Based Reinforcement Learning. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023.
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Robotics: Science and Systems.
- Schaal, S. 1996. Learning from demonstration. Advances in neural information processing systems, 9.
- Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations.
- Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine learning, 3: 9–44.
- On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6.
- CORL: Research-oriented deep offline reinforcement learning library. Advances in Neural Information Processing Systems, 36.
- Thomas, P. 2014. Bias in natural actor-critic algorithms. In International conference on machine learning, 441–448. PMLR.
- Visualizing data using t-SNE. Journal of machine learning research, 9(11).
- Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817.
- Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning. Advances in Neural Information Processing Systems, 36.
- Posterior collapse and latent variable non-identifiability. Advances in Neural Information Processing Systems, 34: 5443–5455.
- A survey of transfer learning. Journal of Big data, 3: 1–40.
- Supported Policy Optimization for Offline Reinforcement Learning. arXiv preprint arXiv:2202.06239.
- Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34: 28954–28967.
- Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning. In The Twelfth International Conference on Learning Representations.
- Dexterous manipulation with deep reinforcement learning: Efficient, general, and low-cost. In 2019 International Conference on Robotics and Automation (ICRA), 3651–3657. IEEE.
- Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run paper prompts using GPT-5.