Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step Q-learning: A Novel Correction Approach (2208.00755v4)

Published 1 Aug 2022 in cs.LG and cs.AI

Abstract: Compared to on-policy counterparts, off-policy model-free deep reinforcement learning can improve data efficiency by repeatedly using the previously gathered data. However, off-policy learning becomes challenging when the discrepancy between the underlying distributions of the agent's policy and collected data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences, which eventually increases the computational complexity. Moreover, their generalization to either continuous action domains or policies approximated by deterministic deep neural networks is strictly limited. To overcome these limitations, we introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control. Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks. Theoretical and empirical studies demonstrate that it can achieve a "safe" off-policy learning and substantially improve the state-of-the-art by attaining higher returns in fewer steps than the competing methods through an effective schedule of the learning rate in Q-learning and policy optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. SFP: State-free priors for exploration in off-policy reinforcement learning. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=qYNfwFCX9a.
  2. Richard Bellman. Dynamic Programming. Dover Publications, 1957. ISBN 9780486428093.
  3. Neuro-dynamic programming. Athena Scientific, Belmont, MA, 1996.
  4. Openai gym. CoRR, abs/1606.01540, 2016. URL http://arxiv.org/abs/1606.01540.
  5. Chapter 2 - principles of parallel and distributed computing. In Rajkumar Buyya, Christian Vecchiola, and S. Thamarai Selvi (eds.), Mastering Cloud Computing, pp.  29–70. Morgan Kaufmann, Boston, 2013. ISBN 978-0-12-411454-8. doi: https://doi.org/10.1016/B978-0-12-411454-8.00002-4. URL https://www.sciencedirect.com/science/article/pii/B9780124114548000024.
  6. Chapter 2 - high-performance embedded computing. In João M.P. Cardoso, José Gabriel F. Coutinho, and Pedro C. Diniz (eds.), Embedded Computing for High Performance, pp.  17–56. Morgan Kaufmann, Boston, 2017. ISBN 978-0-12-804189-5. doi: https://doi.org/10.1016/B978-0-12-804189-5.00002-8. URL https://www.sciencedirect.com/science/article/pii/B9780128041895000028.
  7. Off-policy correction for deep deterministic policy gradient algorithms via batch prioritized experience replay. In 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pp.  1255–1262, 2021. doi: 10.1109/ICTAI52525.2021.00199.
  8. Off-policy actor-critic. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pp.  179–186, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.
  9. Off-policy evaluation with out-of-sample guarantees. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=XnYtGPgG9p. Expert Certification.
  10. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1407–1416. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/espeholt18a.html.
  11. Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1587–1596. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/fujimoto18a.html.
  12. Off-policy deep reinforcement learning without exploration. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2052–2062. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/fujimoto19a.html.
  13. Off-policy actor-critic with emphatic weightings. Journal of Machine Learning Research, 24(146):1–63, 2023. URL http://jmlr.org/papers/v24/21-1350.html.
  14. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/a1d7311f2a312426d710e1c617fcbc8c-Paper.pdf.
  15. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1861–1870. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/haarnoja18b.html.
  16. Dimension-wise importance sampling weight clipping for sample-efficient reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2586–2595. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/han19b.html.
  17. Q(λ𝜆\lambdaitalic_λ) with off-policy corrections. In ALT, pp.  305–320, 2016. URL https://doi.org/10.1007/978-3-319-46379-7_21.
  18. Deep reinforcement learning that matters. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. URL https://ojs.aaai.org/index.php/AAAI/article/view/11694.
  19. Timothy Hesterberg. Advances in importance sampling. 01 2003.
  20. Action noise in off-policy deep reinforcement learning: Impact on exploration and performance. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=NljBlZ6hmG. Survey Certification.
  21. Relative importance sampling for off-policy actor-critic in deep reinforcement learning, 2019.
  22. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016. URL http://arxiv.org/abs/1509.02971.
  23. Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3):293–321, May 1992. ISSN 1573-0565. doi: 10.1007/BF00992699. URL https://doi.org/10.1007/BF00992699.
  24. Weighted importance sampling for off-policy learning with linear function approximation. In NIPS, 2014.
  25. Dan C. Marinescu. Chapter 3 - parallel processing and distributed computing. In Dan C. Marinescu (ed.), Cloud Computing (Third Edition), pp.  41–94. Morgan Kaufmann, third edition edition, 2023. ISBN 978-0-323-85277-7. doi: https://doi.org/10.1016/B978-0-32-385277-7.00010-5. URL https://www.sciencedirect.com/science/article/pii/B9780323852777000105.
  26. Playing atari with deep reinforcement learning. 2013. URL http://arxiv.org/abs/1312.5602. cite arxiv:1312.5602Comment: NIPS Deep Learning Workshop 2013.
  27. Human-level control through deep reinforcement learning. Nature, 518:529–33, 02 2015. doi: 10.1038/nature14236.
  28. Safe and efficient off-policy reinforcement learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/c3992e9a68c5ae12bd18488bc579b30d-Paper.pdf.
  29. Art B. Owen. Monte Carlo theory, methods and examples. 2013.
  30. Ian Parberry. Introduction to Game Physics with Box2D. CRC Press, Inc., USA, 1st edition, 2013. ISBN 1466565764.
  31. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 06 2000.
  32. Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, 06 2001.
  33. Antonin Raffin. Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo, 2020.
  34. Alfréd Rényi. On Measures of Entropy and Information. The Regents of the University of California, 1961.
  35. Safe and robust experience sharing for deterministic policy gradient algorithms, 2022a. URL https://arxiv.org/abs/2207.13453.
  36. Actor prioritized experience replay, 2022b. URL https://arxiv.org/abs/2209.00532.
  37. Prioritized experience replay. In ICLR (Poster), 2016. URL http://arxiv.org/abs/1511.05952.
  38. Off-policy actor-critic with shared experience replay. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  8545–8554. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/schmitt20a.html.
  39. Claude Elwood Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 1948. URL http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.
  40. Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, Aug 1988. ISSN 1573-0565. doi: 10.1007/BF00115009. URL https://doi.org/10.1007/BF00115009.
  41. Reinforcement learning: An introduction. MIT press, 2018.
  42. An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research, 17(73):1–29, 2016. URL http://jmlr.org/papers/v17/14-488.html.
  43. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.
  44. Tim van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014. doi: 10.1109/TIT.2014.2320500.
  45. Christopher Watkins. Learning from delayed rewards, 01 1989.
  46. Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3):279–292, May 1992. ISSN 1573-0565. doi: 10.1007/BF00992698. URL https://doi.org/10.1007/BF00992698.
  47. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
  48. On the reuse bias in off-policy reinforcement learning. In Edith Elkind (ed.), Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pp.  4513–4521. International Joint Conferences on Artificial Intelligence Organization, 8 2023. doi: 10.24963/ijcai.2023/502. URL https://doi.org/10.24963/ijcai.2023/502. Main Track.
  49. On generalized bellman equations and temporal-difference learning. Journal of Machine Learning Research, 19(48):1–49, 2018. URL http://jmlr.org/papers/v19/17-283.html.
  50. Experience replay optimization. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp.  4243–4249. International Joint Conferences on Artificial Intelligence Organization, 7 2019. doi: 10.24963/ijcai.2019/589. URL https://doi.org/10.24963/ijcai.2019/589.
  51. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.
  52. Generalized off-policy actor-critic. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/0e095e054ee94774d6a496099eb1cf6a-Paper.pdf.
Citations (1)

Summary

We haven't generated a summary for this paper yet.