Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conservative State Value Estimation for Offline Reinforcement Learning (2302.06884v2)

Published 14 Feb 2023 in cs.LG and cs.AI

Abstract: Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the BeLLMan iterations. Meanwhile, to avoid extrapolation on out-of-distribution (OOD) states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states. Compared to prior work, CSVE allows more effective state value estimation with conservative guarantees and further better policy optimization. Further, we apply CSVE and develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states \emph{around} the dataset, and the actor applies advantage weighted updates extended with state exploration to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Off-policy deep reinforcement learning without exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2052–2062. PMLR, 09–15 Jun 2019.
  2. Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer, 2012.
  3. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 19967–20025. PMLR, 17–23 Jul 2022.
  4. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  5. The importance of pessimism in fixed-dataset policy optimization. In International Conference on Learning Representations, 2020.
  6. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  7. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  8. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
  9. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  10. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. In International Conference on Learning Representations, 2021.
  11. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  12. Awac: Accelerating online reinforcement learning with offline datasets, 2020.
  13. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020.
  14. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  15. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  16. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  17. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 32, 2019.
  18. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. CoRR, abs/1910.00177, 2019.
  19. Jonathon Shlens. Notes on kullback-leibler divergence and likelihood. CoRR, abs/1404.2000, 2014.
  20. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  21. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021.
  22. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  23. Rambo-rl: Robust adversarial model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 2022.
  24. Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140, 2021.
  25. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  26. Coindice: Off-policy confidence interval estimation. Advances in neural information processing systems, 33:9398–9411, 2020.
  27. Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33:6551–6561, 2020.
  28. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783. PMLR, 2021.
  29. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020.
  30. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020.
  31. Offline model-based adaptable policy learning. Advances in Neural Information Processing Systems, 34:8432–8443, 2021.
  32. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pages 3652–3661. PMLR, 2019.
  33. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3647–3655, 2019.
  34. d3rlpy: An offline deep reinforcement learning library. arXiv preprint arXiv:2111.03788, 2021.
  35. A closer look at offline rl agents. In Advances in Neural Information Processing Systems, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Liting Chen (6 papers)
  2. Jie Yan (25 papers)
  3. Zhengdao Shao (1 paper)
  4. Lu Wang (329 papers)
  5. Qingwei Lin (81 papers)
  6. Saravan Rajmohan (85 papers)
  7. Thomas Moscibroda (8 papers)
  8. Dongmei Zhang (193 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com