Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning (2401.00629v2)

Published 1 Jan 2024 in cs.LG

Abstract: We propose WSAC (Weighted Safe Actor-Critic), a novel algorithm for Safe Offline Reinforcement Learning (RL) under functional approximation, which can robustly optimize policies to improve upon an arbitrary reference policy with limited data coverage. WSAC is designed as a two-player Stackelberg game to optimize a refined objective function. The actor optimizes the policy against two adversarially trained value critics with small importance-weighted BeLLMan errors, which focus on scenarios where the actor's performance is inferior to the reference policy. In theory, we demonstrate that when the actor employs a no-regret optimization oracle, WSAC achieves a number of guarantees: (i) For the first time in the safe offline RL setting, we establish that WSAC can produce a policy that outperforms any reference policy while maintaining the same level of safety, which is critical to designing a safe algorithm for offline RL. (ii) WSAC achieves the optimal statistical convergence rate of $1/\sqrt{N}$ to the reference policy, where $N$ is the size of the offline dataset. (iii) We theoretically show that WSAC guarantees a safe policy improvement across a broad range of hyperparameters that control the degree of pessimism, indicating its practical robustness. Additionally, we offer a practical version of WSAC and compare it with existing state-of-the-art safe offline RL algorithms in several continuous control environments. WSAC outperforms all baselines across a range of tasks, supporting the theoretical results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Constrained policy optimization. In Int. Conf. Machine Learning (ICML), volume 70, pages 22–31. JMLR.
  2. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76.
  3. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89–129.
  4. Neuro Dynamic Programming. Athena Scientific.
  5. Convex Optimization. Cambridge Univ. Press, New York, NY.
  6. Constrained episodic reinforcement learning in concave-convex and knapsack settings. In Advances Neural Information Processing Systems (NeurIPS), volume 33, pages 16315–16326. Curran Associates, Inc.
  7. Safe exploration for constrained reinforcement learning with provable guarantees. arXiv preprint arXiv:2112.00885.
  8. A near-optimal primal-dual method for off-policy learning in cmdp. In Advances Neural Information Processing Systems (NeurIPS), volume 35, pages 10521–10532.
  9. Information-theoretic considerations in batch reinforcement learning. In Int. Conf. Machine Learning (ICML), pages 1042–1051. PMLR.
  10. Learning infinite-horizon average-reward markov decision process with constraints. In Int. Conf. Machine Learning (ICML), pages 3246–3270. PMLR.
  11. Policy improvement via imitation of multiple oracles. In Advances Neural Information Processing Systems (NeurIPS), volume 33, pages 5587–5598.
  12. Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning, pages 3852–3878. PMLR.
  13. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120.
  14. Provably efficient safe exploration via primal-dual policy optimization. In Int. Conf. Artificial Intelligence and Statistics (AISTATS), volume 130, pages 3304–3312. PMLR.
  15. Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints. arXiv preprint arXiv:2201.11965.
  16. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189.
  17. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189.
  18. A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pages 486–489. PMLR.
  19. Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
  20. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145.
  21. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR.
  22. Addressing function approximation error in actor-critic methods. In Int. Conf. Machine Learning (ICML), pages 1582–1591.
  23. Rectified pessimistic-optimistic learning for stochastic continuum-armed bandit with constraints. In Learning for Dynamics and Control Conference, pages 1333–1344. PMLR.
  24. Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Int. Conf. Machine Learning (ICML), pages 1861–1870.
  25. A primal-dual-critic algorithm for offline constrained reinforcement learning. arXiv preprint arXiv:2306.07818.
  26. Safe reinforcement learning on autonomous vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–6. IEEE.
  27. Safety-gymnasium. GitHub repository.
  28. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
  29. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR.
  30. Approximately optimal approximate reinforcement learning. In Int. Conf. Machine Learning (ICML), pages 267–274.
  31. Kakade, S. M. (2001). A natural policy gradient. In Advances Neural Information Processing Systems (NeurIPS).
  32. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823.
  33. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y., editors, Int. Conf. on Learning Representations (ICLR).
  34. Actor-critic algorithms. In Solla, S., Leen, T., and Müller, K., editors, Advances in Neural Information Processing Systems, volume 12. MIT Press.
  35. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
  36. Safe policy improvement with baseline bootstrapping. In International conference on machine learning, pages 3652–3661. PMLR.
  37. Bandit Algorithms. Cambridge University Press.
  38. Optidice: Offline policy optimization via stationary distribution correction estimation. In Meila, M. and Zhang, T., editors, Int. Conf. Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 6120–6130. PMLR.
  39. Learning policies with zero or bounded constraint violation for constrained MDPs. In Advances Neural Information Processing Systems (NeurIPS), volume 34.
  40. Provably good batch off-policy reinforcement learning without great exploration. Advances in neural information processing systems, 33:1264–1274.
  41. Datasets and benchmarks for offline safe reinforcement learning. arXiv preprint arXiv:2306.09303.
  42. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5).
  43. Hyperparameter selection for offline reinforcement learning. arXiv preprint arXiv:2007.09055.
  44. Safe policy iteration. In Int. Conf. Machine Learning (ICML), pages 307–315. PMLR.
  45. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In Advances Neural Information Processing Systems (NeurIPS), volume 34, pages 11702–11716.
  46. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435.
  47. Hyperparameter tuning in offline reinforcement learning. In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), pages 585–590.
  48. Von Stackelberg, H. (2010). Market structure and equilibrium. Springer Science & Business Media.
  49. Provably efficient model-free algorithms for non-stationary CMDPs. In Int. Conf. Artificial Intelligence and Statistics (AISTATS), pages 6527–6570. PMLR.
  50. A provably-efficient model-free algorithm for infinite-horizon average-reward constrained markov decision processes. In AAAI Conf. Artificial Intelligence.
  51. Triple-Q: a model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In Int. Conf. Artificial Intelligence and Statistics (AISTATS).
  52. Offline constrained multi-objective reinforcement learning via pessimistic dual value iteration. In Advances Neural Information Processing Systems (NeurIPS), volume 34, pages 25439–25451.
  53. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694.
  54. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR.
  55. Constraints penalized q-learning for safe offline reinforcement learning. In AAAI Conf. Artificial Intelligence, volume 36, pages 8753–8760.
  56. Mopo: Model-based offline policy optimization. Advances Neural Information Processing Systems (NeurIPS), 33:14129–14142.
  57. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640.
  58. Towards hyperparameter-free policy selection for offline reinforcement learning. Advances Neural Information Processing Systems (NeurIPS), 34:12864–12875.
  59. Importance weighted actor-critic for optimal conservative offline reinforcement learning. arXiv preprint arXiv:2301.12714.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com