Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning (2311.18684v1)

Published 30 Nov 2023 in cs.LG

Abstract: By reusing data throughout training, off-policy deep reinforcement learning algorithms offer improved sample efficiency relative to on-policy approaches. For continuous action spaces, the most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of $Q$ values. With an eye toward safety, we revisit this strategy in environments with "mixed-sign" reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. We find the combination of function approximation and a term that maximizes $Q$ in the policy update to be problematic in such environments, because systematic errors in value estimation impact the contributions from the competing terms asymmetrically. This results in overemphasis of either incentives or costs and may severely limit learning. We explore two remedies to this issue. First, consistent with prior work, we find that periodic resetting of $Q$ and policy networks can be used to reduce value estimation error and improve learning in this setting. Second, we formulate novel off-policy actor-critic methods for both unconstrained and constrained learning that do not explicitly maximize $Q$ in the policy update. We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting. We further find that our approach produces agents that are both competitive with popular methods overall and more reliably competent on frequently-studied control problems that do not have mixed-sign rewards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Constrained policy optimization. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  22–31. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/achiam17a.html.
  2. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 2021.
  3. Solving rubik’s cube with a robot hand, 2019.
  4. Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods (Optimization and Neural Computation Series). Athena Scientific, 1 edition, 1996. ISBN 1886529043.
  5. Conservative safety critics for exploration. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=iaO86DUuKi.
  6. Shalabh Bhatnagar. An actor–critic algorithm with function approximation for discounted cost constrained markov decision processes. Systems & Control Letters, 59(12):760–766, 2010. ISSN 0167-6911. doi: https://doi.org/10.1016/j.sysconle.2010.08.013. URL https://www.sciencedirect.com/science/article/pii/S0167691110001246.
  7. Convex optimization. Cambridge university press, 2004.
  8. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=AY8zfZm0tDd.
  9. Lyapunov-based safe policy optimization for continuous control. CoRR, abs/1901.10031, 2019. URL http://arxiv.org/abs/1901.10031.
  10. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, February 2022. ISSN 1476-4687. doi: 10.1038/s41586-021-04301-9. URL https://doi.org/10.1038/s41586-021-04301-9.
  11. Off-policy actor-critic. In International Conference on Machine Learning, 2012.
  12. Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=OpC-9aBBVJe.
  13. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.
  14. Diagnosing bottlenecks in deep q-learning algorithms. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2021–2030. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/fu19a.html.
  15. Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1587–1596. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/fujimoto18a.html.
  16. Clipped action policy gradient. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1597–1606. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/fujita18a.html.
  17. Q-prop: Sample-efficient policy gradient with an off-policy critic. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SJ3rcZcxl.
  18. Learning to walk in the real world with minimal human effort. arXiv preprint arXiv:2002.08550, 2020.
  19. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  1856–1865. PMLR, 2018. URL http://proceedings.mlr.press/v80/haarnoja18b.html.
  20. Soft actor-critic algorithms and applications, 2019.
  21. Hado Hasselt. Double q-learning. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta (eds.), Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper_files/paper/2010/file/091d584fced301b442654dd8c23b3fc9-Paper.pdf.
  22. Dropout q-functions for doubly efficient reinforcement learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=xCVJMsPv3RT.
  23. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.
  24. Variational dropout and the local reparameterization trick, 2015.
  25. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020.
  26. Efficient deep reinforcement learning requires regulating overfitting. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=14-kr46GvP-.
  27. Conflict-averse gradient descent for multi-task learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  18878–18890. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/9d27fdf2477ffbff837d73ef7ae23db9-Paper.pdf.
  28. A risk-sensitive approach to policy optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):15019–15027, Jun. 2023. doi: 10.1609/aaai.v37i12.26753. URL https://ojs.aaai.org/index.php/AAAI/article/view/26753.
  29. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  30. Reload: Reinforcement learning with optimistic ascent-descent for last-iterate convergence in constrained mdps. In International Conference on Machine Learning, pp. 25303–25336. PMLR, 2023.
  31. The primacy bias in deep reinforcement learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  16828–16847. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/nikishin22a.html.
  32. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/c1aeb6517a1c7f33514f7ff69047e74e-Paper.pdf.
  33. Benchmarking Safe Exploration in Deep Reinforcement Learning. 2019.
  34. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2017a.
  35. Proximal policy optimization algorithms. arXiv:1707.06347, 2017b.
  36. Bigger, better, faster: Human-level Atari with human-level efficiency. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  30365–30380. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/schwarzer23a.html.
  37. Deterministic policy gradient algorithms. In ICML, volume 32 of JMLR Workshop and Conference Proceedings, pp.  387–395. JMLR.org, 2014. URL http://dblp.uni-trier.de/db/conf/icml/icml2014.html#SilverLHDWR14.
  38. Reward constrained policy optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  39. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/article/pii/S2665963820300099.
  40. Sample efficient actor-critic with experience replay. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=HyM25Mqel.
  41. Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):10639–10646, May 2021. doi: 10.1609/aaai.v35i12.17272. URL https://ojs.aaai.org/index.php/AAAI/article/view/17272.
  42. Gradient surgery for multi-task learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  5824–5836. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/3fe78a8acf5fda99de95303940a2420c-Paper.pdf.
  43. First order constrained optimization in policy space. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  15338–15349. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/af5d5ef24881f3c3049a7b9bfe74d58b-Paper.pdf.
  44. Constrained soft actor-critic for energy-aware trajectory design in uav-aided iot networks. IEEE Wireless Communications Letters, 11(7):1414–1418, 2022. doi: 10.1109/LWC.2022.3172336.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jared Markowitz (9 papers)
  2. Jesse Silverberg (1 paper)
  3. Gary Collins (2 papers)