Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 85 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Offline Reinforcement Learning with OOD State Correction and OOD Action Suppression (2410.19400v4)

Published 25 Oct 2024 in cs.LG and cs.AI

Abstract: In offline reinforcement learning (RL), addressing the out-of-distribution (OOD) action issue has been a focus, but we argue that there exists an OOD state issue that also impairs performance yet has been underexplored. Such an issue describes the scenario when the agent encounters states out of the offline dataset during the test phase, leading to uncontrolled behavior and performance degradation. To this end, we propose SCAS, a simple yet effective approach that unifies OOD state correction and OOD action suppression in offline RL. Technically, SCAS achieves value-aware OOD state correction, capable of correcting the agent from OOD states to high-value in-distribution states. Theoretical and empirical results show that SCAS also exhibits the effect of suppressing OOD actions. On standard offline RL benchmarks, SCAS achieves excellent performance without additional hyperparameter tuning. Moreover, benefiting from its OOD state correction feature, SCAS demonstrates enhanced robustness against environmental perturbations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017.
  2. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021.
  3. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Y4cs1Z3HnqL.
  4. A survey of meta-reinforcement learning. arXiv preprint arXiv:2301.08028, 2023.
  5. Offline rl without off-policy evaluation. Advances in Neural Information Processing Systems, 34:4933–4946, 2021.
  6. Enhancing offline reinforcement learning with an optimal supported dataset, 2024. URL https://openreview.net/forum?id=1Akd36hG9z.
  7. Adversarially trained actor critic for offline reinforcement learning. arXiv preprint arXiv:2202.02446, 2022.
  8. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.
  9. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  10. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  11. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  12. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
  13. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR, 2019.
  14. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
  15. Extreme q-learning: Maxent RL without entropy. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SJ0Lde3tRL.
  16. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021.
  17. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, 55(2):895–943, 2022.
  18. A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
  19. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  20. Beyond uniform sampling: Offline reinforcement learning with imbalanced datasets. Advances in Neural Information Processing Systems, 36:4985–5009, 2023.
  21. When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
  22. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR, 2022.
  23. Recovering from out-of-sample states via inverse dynamics in offline reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  24. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  25. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
  26. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  27. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  28. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
  29. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783. PMLR, 2021.
  30. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=68n2s9ZJWF8.
  31. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  32. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  33. Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer, 2012.
  34. Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning, pages 6120–6130. PMLR, 2021.
  35. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  36. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.
  37. Mildly conservative q-learning for offline reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=VYYf6S67pQc.
  38. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
  39. ODICE: Revealing the mystery of distribution correction estimation via orthogonal-gradient update. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=L8UNn7Llt4.
  40. Supported trust region optimization for offline reinforcement learning. In International Conference on Machine Learning, pages 23829–23851. PMLR, 2023a.
  41. Supported value regularization for offline reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  42. Deployment-efficient reinforcement learning via model-based offline optimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=3hGNqpI4WS.
  43. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  44. Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 16(1):1–118, 2023.
  45. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
  46. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  47. Automatic differentiation in pytorch. 2017.
  48. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  49. Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
  50. Neorl: A near real-world benchmark for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:24753–24765, 2022.
  51. Hokoff: Real game dataset from honor of kings and its offline reinforcement learning benchmarks. In Thirty-seventh Conference on Neural Information Processing Systems Track on Datasets and Benchmarks, 2023.
  52. Choices are more important than efforts: Llm enables efficient multi-agent exploration. arXiv preprint arXiv:2410.02511, 2024.
  53. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21(178):1–51, 2020.
  54. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  55. Counterfactual conservative q learning for offline multi-agent reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=62zmO4mv8X.
  56. Complementary attention for multi-agent reinforcement learning. In International Conference on Machine Learning, pages 30776–30793. PMLR, 2023b.
  57. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. Pmlr, 2014.
  58. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  59. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
  60. Model-bellman inconsistency for model-based offline reinforcement learning. In International Conference on Machine Learning, pages 33177–33194. PMLR, 2023.
  61. Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
  62. Reinforcement learning: An introduction. MIT press, 2018.
  63. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  64. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  65. Robust fast adaptation from adversarially explicit task distribution generation. arXiv preprint arXiv:2407.19523, 2024a.
  66. Qi Wang and Herke Van Hoof. Model-based meta reinforcement learning using graph structured surrogate models and amortized policy search. In International Conference on Machine Learning, pages 23055–23077. PMLR, 2022.
  67. A simple yet effective strategy to robustify the meta learning paradigm. Advances in Neural Information Processing Systems, 36, 2024b.
  68. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020.
  69. Supported policy optimization for offline reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=KCXQ5HoM-fy.
  70. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  71. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694, 2021.
  72. Offline RL with no OOD actions: In-sample learning via implicit value regularization. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ueYYgo2pSSU.
  73. RORL: Robust offline reinforcement learning via conservative smoothing. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_QzJJGH_KE.
  74. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  75. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  76. State deviation correction for offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9022–9030, 2022.
  77. In-sample actor critic for offline reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=dfDv0WU853R.
  78. Pessimistic policy iteration for offline reinforcement learning, 2023b. URL https://openreview.net/forum?id=TmJtBnIWkB.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 3 likes.