Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 103 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 37 tok/s Pro
GPT-4o 92 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 241 tok/s Pro
2000 character limit reached

SAMG: Offline-to-Online Reinforcement Learning via State-Action-Conditional Offline Model Guidance (2410.18626v2)

Published 24 Oct 2024 in cs.LG and cs.AI

Abstract: Offline-to-online (O2O) reinforcement learning (RL) pre-trains models on offline data and refines policies through online fine-tuning. However, existing O2O RL algorithms typically require maintaining the tedious offline datasets to mitigate the effects of out-of-distribution (OOD) data, which significantly limits their efficiency in exploiting online samples. To address this deficiency, we introduce a new paradigm for O2O RL called State-Action-Conditional Offline \Model Guidance (SAMG). It freezes the pre-trained offline critic to provide compact offline understanding for each state-action sample, thus eliminating the need for retraining on offline data. The frozen offline critic is incorporated with the online target critic weighted by a state-action-adaptive coefficient. This coefficient aims to capture the offline degree of samples at the state-action level, and is updated adaptively during training. In practice, SAMG could be easily integrated with Q-function-based algorithms. Theoretical analysis shows good optimality and lower estimation error. Empirically, SAMG outperforms state-of-the-art O2O RL algorithms on the D4RL benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Playing hard exploration games by watching YouTube. In Advances in Neural Information Processing Systems.
  2. Never Give Up: Learning Directed Exploration Strategies. In International Conference on Learning Representations.
  3. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35: 24639–24654.
  4. Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning. arXiv preprint arXiv:2211.11802.
  5. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
  6. Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33: 18353–18363.
  7. Dvoretzky, A. 1959. A theorem on convex bodies and applications to Banach spaces. Proceedings of the National Academy of Sciences, 45(2): 223–226.
  8. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
  9. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34: 20132–20145.
  10. Off-Policy Deep Reinforcement Learning without Exploration. In Proceedings of the 36th International Conference on Machine Learning.
  11. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861–1870. PMLR.
  12. Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence.
  13. A review of deep transfer learning and recent advancements. Technologies, 11(2): 40.
  14. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 12498–12509.
  15. Policy optimization with demonstrations. In International conference on machine learning, 2469–2478. PMLR.
  16. A theorem on contraction mappings. J. Math. Anal. Appl, 28: 326–329.
  17. Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27.
  18. Offline Reinforcement Learning with Implicit Q-Learning. In International Conference on Learning Representations.
  19. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 11761–11771.
  20. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191.
  21. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, 1702–1712. PMLR.
  22. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
  23. Guided Online Distillation: Promoting Safe Reinforcement Learning by Offline Demonstration.
  24. Steve-1: A generative model for text-to-behavior in minecraft. Advances in Neural Information Processing Systems, 36.
  25. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control. In International Conference on Learning Representations.
  26. Understanding Posterior Collapse in Generative Latent Variable Models.
  27. Mildly Conservative Q-Learning for Offline Reinforcement Learning. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
  28. Moore: Model-based offline-to-online reinforcement learning. arXiv preprint arXiv:2201.10070.
  29. Fine-tuning Offline Policies with Optimistic Action Selection. In Deep Reinforcement Learning Workshop NeurIPS 2022.
  30. Müllner, D. 2011. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378.
  31. Accelerating Online Reinforcement Learning with Offline Datasets. arXiv preprint arXiv:2006.09359.
  32. Overcoming Exploration in Reinforcement Learning with Demonstrations. In IEEE International Conference on Robotics and Automation.
  33. Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023.
  34. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems.
  35. MOTO: Offline to Online Fine-tuning for Model-Based Reinforcement Learning. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023.
  36. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Robotics: Science and Systems.
  37. Schaal, S. 1996. Learning from demonstration. Advances in neural information processing systems, 9.
  38. Hybrid RL: Using both offline and online data can make RL efficient. In The Eleventh International Conference on Learning Representations.
  39. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine learning, 3: 9–44.
  40. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6.
  41. CORL: Research-oriented deep offline reinforcement learning library. Advances in Neural Information Processing Systems, 36.
  42. Thomas, P. 2014. Bias in natural actor-critic algorithms. In International conference on machine learning, 441–448. PMLR.
  43. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  44. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817.
  45. Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning. Advances in Neural Information Processing Systems, 36.
  46. Posterior collapse and latent variable non-identifiability. Advances in Neural Information Processing Systems, 34: 5443–5455.
  47. A survey of transfer learning. Journal of Big data, 3: 1–40.
  48. Supported Policy Optimization for Offline Reinforcement Learning. arXiv preprint arXiv:2202.06239.
  49. Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
  50. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34: 28954–28967.
  51. Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning. In The Twelfth International Conference on Learning Representations.
  52. Dexterous manipulation with deep reinforcement learning: Efficient, general, and low-cost. In 2019 International Conference on Robotics and Automation (ICRA), 3651–3657. IEEE.
  53. Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube