Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control (2309.14597v3)

Published 26 Sep 2023 in cs.LG

Abstract: Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In this work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Deep reinforcement learning at the edge of the statistical precipice. In Neural Information Processing Systems, 2021.
  2. Deep reinforcement learning at the edge of the statistical precipice. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 29304–29320, 2021.
  3. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, 2018.
  4. The DeepMind JAX Ecosystem, 2020.
  5. Visualizing the loss landscape of actor critic methods with applications in inventory optimization. arXiv preprint arXiv:2009.02391, 2020.
  6. Distributional reinforcement learning, 2022.
  7. The arcade learning environment: An evaluation platform for general agents (extended abstract). In International Joint Conference on Artificial Intelligence, 2012.
  8. JAX: composable transformations of Python+NumPy programs, 2018.
  9. Measuring the Reliability of Reinforcement Learning Algorithms. In International Conference on Learning Representations, 2019.
  10. Herman Chernoff. Estimation of the mode. Annals of the Institute of Statistical Mathematics, 16(1):31–41, 1964.
  11. A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms. In Reproducibility in Machine Learning, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019.
  12. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
  13. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
  14. Topology and geometry of half-rectified network optimization. In 5th International Conference on Learning Representations, ICLR 2017, 2017.
  15. Brax - a differentiable physics engine for large scale rigid body simulation. ArXiv, abs/2106.13281, 2021.
  16. Addressing function approximation error in actor-critic methods. ArXiv, abs/1802.09477, 2018.
  17. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  18. Qualitatively characterizing neural network optimization problems. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  19. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  20. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  21. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1–18, 2022.
  22. John D Hunter. Matplotlib: A 2d graphics environment. IEEE Annals of the History of Computing, 9(03):90–95, 2007.
  23. A Closer Look at Deep Policy Gradients. In International Conference on Learning Representations, December 2019.
  24. SciPy: Open source scientific tools for Python. 2014.
  25. Never Worse, Mostly Better: Stable Policy Improvement in Deep Reinforcement Learning, August 2022.
  26. Jupyter Notebooks-a publishing format for reproducible computational workflows., volume 2016. 2016.
  27. Ilya Kostrikov. JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10 2021.
  28. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
  29. Longxin Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8:293–321, 1992.
  30. Safe policy iteration: A monotonically improving approximate policy iteration approach. Journal of Machine Learning Research, 22(97):1–83, 2021.
  31. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  32. Improving stability in deep reinforcement learning with weight averaging. In Uncertainty in Artificial Intelligence Workshop on Uncertainty in Deep Learning, 2018.
  33. Travis E Oliphant. A guide to NumPy, volume 1. Trelgol Publishing USA, 2006.
  34. Travis E Oliphant. Python for scientific computing. Computing in Science & Engineering, 9(3):10–20, 2007.
  35. Training larger networks for deep reinforcement learning. ArXiv, abs/2102.07920, 2021.
  36. The pandas development team. pandas-dev/pandas: Pandas, February 2020.
  37. Smoothing policies and safe policy gradients. ArXiv, abs/1905.03231, 2022.
  38. Pipps: Flexible model-based policy search robust to the curse of chaos. In International Conference on Machine Learning, pages 4065–4074. PMLR, 2018.
  39. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
  40. The phenomenon of policy churn. ArXiv, abs/2206.00730, 2022.
  41. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28:105–130, 1997.
  42. Trust region policy optimization. ArXiv, abs/1502.05477, 2015.
  43. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.
  44. Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments. In Proceedings of the 39th International Conference on Machine Learning, pages 20744–20776. PMLR, June 2022.
  45. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  46. The numpy array: a structure for efficient numerical computation. Computing in science & engineering, 13(2):22–30, 2011.
  47. Python tutorial, volume 620. Centrum voor Wiskunde en Informatica Amsterdam, 1995.
  48. Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56 – 61, 2010.
  49. Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. In AAAI, pages 10639–10646, 2021.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets