Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization (2311.03351v4)

Published 6 Nov 2023 in cs.LG and cs.RO

Abstract: Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Efficient online reinforcement learning with offline data. Feb 2023.
  2. Offline rl without off-policy evaluation. Advances in neural information processing systems, 34:4933–4946, 2021.
  3. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  4. Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning, pp. 3852–3878. PMLR, 2022.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North, Jan 2019. doi: 10.18653/v1/n19-1423. URL http://dx.doi.org/10.18653/v1/n19-1423.
  6. Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
  7. Implementation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations, 2019.
  8. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  9. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  10. Off-policy deep reinforcement learning without exploration. arXiv: Learning,arXiv: Learning, Dec 2018.
  11. Extreme q-learning: Maxent rl without entropy. arXiv preprint arXiv:2301.02328, 2023.
  12. Why generalization in rl is difficult: Epistemic pomdps and implicit partial observability. Advances in Neural Information Processing Systems, 34:25502–25515, 2021.
  13. A simple unified uncertainty-guided framework for offline-to-online reinforcement learning. arXiv preprint arXiv:2306.07541, 2023.
  14. Benchmarking offline reinforcement learning on real-robot hardware. arXiv preprint arXiv:2307.15690, 2023.
  15. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
  16. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  17. Confidence-conditioned value functions for offline reinforcement learning. Dec 2022.
  18. Beyond uniform sampling: Offline reinforcement learning with imbalanced datasets. arXiv preprint arXiv:2310.04413, 2023.
  19. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661. PMLR, 2016.
  20. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  21. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  22. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. 5th Annual Conference on Robot Learning,5th Annual Conference on Robot Learning, Jun 2021.
  23. Proto: Iterative policy regularized offline-to-online reinforcement learning. arXiv preprint arXiv:2305.15669, 2023.
  24. SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained Networks. arXiv e-prints, 2023.
  25. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. In Conference on Robot Learning, pp.  22–31. PMLR, 2023.
  26. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  27. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  28. Learning on the job: self-rewarding offline-to-online finetuning for industrial insertion of novel connectors from vision. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  7154–7161. IEEE, 2023.
  29. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479, 2023.
  30. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. Advances in Neural Information Processing Systems, 35:36599–36612, 2022.
  31. Hyperparameter selection for offline reinforcement learning. arXiv preprint arXiv:2007.09055, 2020.
  32. Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on Robot Learning, pp.  91–100. PMLR, 2022.
  33. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015a.
  34. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
  35. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  36. Legged robots that keep on learning: Fine-tuning locomotion policies in the real world. In 2022 International Conference on Robotics and Automation (ICRA), pp.  1593–1599. IEEE, 2022.
  37. CORL: Research-oriented deep offline reinforcement learning library. In 3rd Offline RL Workshop: Offline RL as a ”Launchpad”, 2022. URL https://openreview.net/forum?id=SyAS49bBcv.
  38. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp.  23–30. IEEE, 2017.
  39. Empirical study of off-policy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854, 2019.
  40. Manipulate by seeing: Creating manipulation controllers from pre-trained representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3859–3868, 2023.
  41. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
  42. Supported policy optimization for offline reinforcement learning.
  43. Offline rl with no ood actions: In-sample learning via implicit value regularization. arXiv preprint arXiv:2303.15810, 2023.
  44. Rorl: Robust offline reinforcement learning via conservative smoothing. Advances in Neural Information Processing Systems, 35:23851–23866, 2022.
  45. Learning vision-guided quadrupedal locomotion end-to-end with cross-modal transformers. arXiv preprint arXiv:2107.03996, 2021.
  46. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  47. Actor-critic alignment for offline-to-online reinforcement learning. 2023.
  48. Visual reinforcement learning with self-supervised 3d representations. RA-L, 2023a.
  49. H-index: Visual reinforcement learning with hand-informed representations for dexterous manipulation. NeurIPS, 2023b.
  50. Policy expansion for bridging offline-to-online reinforcement learning. arXiv preprint arXiv:2302.00935, 2023.
  51. Improving offline-to-online reinforcement learning with q-ensembles. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023a.
  52. Ensemble-based offline-to-online reinforcement learning: From pessimistic learning to optimistic exploration. arXiv preprint arXiv:2306.06871, 2023b.
  53. Adaptive policy learning for offline-to-online reinforcement learning. arXiv preprint arXiv:2303.07693, 2023.
  54. Online decision transformer. In international conference on machine learning, pp. 27042–27059. PMLR, 2022a.
  55. Online decision transformer. In international conference on machine learning, pp. 27042–27059. PMLR, 2022b.
  56. Real world offline reinforcement learning with realistic data source. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  7176–7183. IEEE, 2023.
  57. Behavior proximal policy optimization. arXiv preprint arXiv:2302.11312, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kun Lei (6 papers)
  2. Zhengmao He (5 papers)
  3. Chenhao Lu (7 papers)
  4. Kaizhe Hu (10 papers)
  5. Yang Gao (761 papers)
  6. Huazhe Xu (93 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.