Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL (2405.18520v1)

Published 28 May 2024 in cs.LG and cs.AI

Abstract: Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks, by leveraging previously collected data for policy learning. However, most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer, limiting sample efficiency and policy performance. In this work, we discover that concurrently training an offline RL policy based on the shared online replay buffer can sometimes outperform the original online learning policy, though the occurrence of such performance gains remains uncertain. This motivates a new possibility of harnessing the emergent outperforming offline optimal policy to improve online policy learning. Based on this insight, we present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy through value comparison, and uses it as an adaptive constraint to guarantee stronger policy learning performance. Our experiments demonstrate that OBAC outperforms other popular model-free RL baselines and rivals advanced model-based RL methods in terms of sample efficiency and asymptotic performance across 53 tasks spanning 6 task suites.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, 2023.
  2. A distributional perspective on reinforcement learning. In International conference on machine learning, 2017.
  3. Bellman, R. A markovian decision process. Journal of mathematics and mechanics, pp.  679–684, 1957.
  4. Myosuite–a contact-rich simulation suite for musculoskeletal motor control. Proceedings of Machine Learning Research, 168, 2022.
  5. Greedification operators for policy optimization: Investigating forward and reverse kl divergences. The Journal of Machine Learning Research, 23(1):11474–11552, 2022.
  6. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, 2019.
  7. Reinforcement learning for selective key applications in power systems: Recent advances and future challenges. IEEE Transactions on Smart Grid, 13(4):2935–2958, 2022.
  8. Reinforcement learning with combinatorial actions: An application to vehicle routing. In Advances in Neural Information Processing Systems, 2020.
  9. Denardo, E. V. Contraction mappings in the theory underlying dynamic programming. Siam Review, 9(2):165–177, 1967.
  10. Sharing knowledge in multi-task deep reinforcement learning. In International Conference on Learning Representations, 2020.
  11. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, 2020.
  12. A minimalist approach to offline reinforcement learning. In Advances in neural information processing systems, 2021.
  13. Addressing function approximation error in actor-critic methods. In International conference on machine learning, 2018.
  14. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, 2019.
  15. Maniskill2: A unified benchmark for generalizable manipulation skills. In The Eleventh International Conference on Learning Representations, 2022.
  16. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 2018a.
  17. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
  18. Td-mpc2: Scalable, robust world models for continuous control. In International Conference on Learning Representations, 2024.
  19. When to update your model: Constrained model-based reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
  20. Seizing serendipity: Exploiting the value of past success in off-policy actor-critic. arXiv preprint arXiv:2306.02865, 2023.
  21. Active lmitation learning: formal and practical reductions to iid learning. J. Mach. Learn. Res., 15(1):3925–3963, 2014.
  22. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. The Journal of Machine Learning Research, 21(1):6742–6804, 2020.
  23. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
  24. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021.
  25. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019.
  26. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020.
  27. Mujoco haptix: A virtual reality system for hand manipulation. In 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp.  657–663. IEEE, 2015.
  28. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. In Advances in Neural Information Processing Systems, 2020.
  29. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, 2022.
  30. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  31. Understanding deep neural function approximation in reinforcement learning via ϵitalic-ϵ\epsilonitalic_ϵ-greedy exploration. In Advances in Neural Information Processing Systems, 2022.
  32. Provably good batch off-policy reinforcement learning without great exploration. In Advances in neural information processing systems, 2020.
  33. Odice: Revealing the mystery of distribution correction estimation via orthogonal-gradient update. arXiv preprint arXiv:2402.00348, 2024.
  34. A graph placement methodology for fast chip design. Nature, 594(7862):207–212, 2021.
  35. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  36. Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 16(1):1–118, 2023.
  37. Tactical optimism and pessimism for deep reinforcement learning. Advances in Neural Information Processing Systems, 34:12849–12863, 2021.
  38. Safe and efficient off-policy reinforcement learning. In Advances in neural information processing systems, 2016.
  39. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
  40. Robust reinforcement learning using offline data. In Advances in neural information processing systems, 2022.
  41. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  42. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
  43. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, 2010.
  44. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  45. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, 2019.
  46. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning, 2022.
  47. Hybrid rl: Using both offline and online data can make rl efficient. In International Conference on Learning Representations, 2022.
  48. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  49. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  5026–5033. IEEE, 2012.
  50. Jump-start reinforcement learning. In International Conference on Machine Learning, 2023.
  51. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, 2016.
  52. Leveraging offline data in online reinforcement learning. In International Conference on Machine Learning, 2023.
  53. Deep reinforcement learning: a survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  54. Controlling underestimation bias in reinforcement learning via quasi-median operation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  55. The in-sample softmax for offline reinforcement learning. In International Conference on Learning Representations, 2022.
  56. Bellman-consistent pessimism for offline reinforcement learning. In Advances in neural information processing systems, 2021a.
  57. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. In Advances in neural information processing systems, 2021b.
  58. Drm: Mastering visual reinforcement learning through dormant ratio minimization. In International Conference on Learning Representations, 2024.
  59. A policy-guided imitation approach for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:4085–4098, 2022a.
  60. Offline rl with no ood actions: In-sample learning via implicit value regularization. In International Conference on Learning Representations, 2022b.
  61. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, 2020.
  62. Actor-critic alignment for offline-to-online reinforcement learning. In International Conference on Machine Learning, 2023.
  63. Replay memory as an empirical mdp: Combining conservative estimation with experience replay. In International Conference on Learning Representations, 2022a.
  64. Policy expansion for bridging offline-to-online reinforcement learning. In International Conference on Learning Representations, 2022b.
  65. Sample efficient reinforcement learning with reinforce. In Proceedings of the AAAI conference on artificial intelligence, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.