Dissecting Deep RL with High Update Ratios: Combatting Value Divergence (2403.05996v3)
Abstract: We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters in settings where the number of gradient updates greatly exceeds the number of environment samples by combatting value function divergence. Under large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias, in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we investigate the phenomena leading to the primacy bias. We inspect the early stages of training that were conjectured to cause the failure to learn and find that one fundamental challenge is a long-standing acquaintance: value function divergence. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be linked to overestimation on unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting early data.
- The primacy bias in deep reinforcement learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 16828–16847. PMLR, 17–23 Jul 2022.
- When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, 2019.
- Revisiting fundamentals of experience replay. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations, 2021.
- Dropout q-functions for doubly efficient reinforcement learning. In International Conference on Learning Representations, 2022.
- Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2023.
- Bigger, better, faster: Human-level Atari with human-level efficiency. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 30365–30380. PMLR, 23–29 Jul 2023.
- Sample-efficient and safe deep reinforcement learning via reset deep ensemble agents. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 53239–53260. Curran Associates, Inc., 2023.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
- Addressing function approximation error in actor-critic methods. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1587–1596. PMLR, 10–15 Jul 2018.
- Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062, 2019.
- A minimalist approach to offline reinforcement learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
- A simple weight decay can improve generalization. In J. Moody, S. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991.
- Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
- Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32, Vancouver, Canada, 2019.
- Striving for simplicity and performance in off-policy DRL: Output normalization and non-uniform sampling. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10070–10080. PMLR, 13–18 Jul 2020.
- Is high variance unavoidable in RL? a case study in continuous control. In International Conference on Learning Representations, 2022.
- dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10.1016/j.simpa.2020.100022.
- TD-MPC2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024.
- Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
- Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.
- Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations, 2016.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861–1870. PMLR, 10–15 Jul 2018.
- Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. doi: 10.1137/0330046.
- Understanding plasticity in neural networks. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 23190–23211. PMLR, 23–29 Jul 2023.
- Harnessing structures for value-based planning and reinforcement learning. In International Conference on Learning Representations, 2020.
- Implicit under-parameterization inhibits data-efficient deep reinforcement learning. In International Conference on Learning Representations, 2021.
- Dean A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988.
- Robot learning from demonstration. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, page 12–20, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. ISBN 1558604863.
- Generalization and regularization in DQN. CoRR, abs/1810.00123, 2018.
- Regularization matters in policy optimization - an empirical study on continuous control. In International Conference on Learning Representations, 2021.
- Efficient deep reinforcement learning requires regulating overfitting. In The Eleventh International Conference on Learning Representations, 2023.
- Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738.
- Layer normalization, 2016.
- Understanding and improving layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
- Rectified linear units improve restricted boltzmann machines. In ICML 2010, pages 807–814, 2010.
- Deep reinforcement learning at the edge of the statistical precipice. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
- Reducing estimation bias via triplet-average deep deterministic policy gradient. IEEE Transactions on Neural Networks and Learning Systems, 31(11):4933–4945, 2020. doi: 10.1109/TNNLS.2019.2959129.
- Maxmin q-learning: Controlling the estimation bias of q-learning. In International Conference on Learning Representations, 2020.
- Estimation error correction in deep reinforcement learning for deterministic actor-critic methods. In 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI), pages 137–144, 2021. doi: 10.1109/ICTAI52525.2021.00027.
- Bridging RL theory and practice with the effective horizon. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Charles Anderson. Q-learning with hidden-unit restarting. In S. Hanson, J. Cowan, and C. Giles, editors, Advances in Neural Information Processing Systems, volume 5. Morgan-Kaufmann, 1992.
- Issues in using function approximation for reinforcement learning. In Michael Mozer, Paul Smolensky, David Touretzky, Jeffrey Elman, and Andreas Weigend, editors, Proceedings of the 1993 Connectionist Models Summer School, pages 255–263. Lawrence Erlbaum, 1993.
- Off-policy temporal difference learning with function approximation. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 417–424, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558607781.
- Estimator variance in reinforcement learning: Theoretical problems and practical solutions. 1997.
- Bias and variance approximation in value function estimates. Manage. Sci., 53(2):308–322, feb 2007.
- Hado Hasselt. Double q-learning. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010.
- Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 2094–2100. AAAI Press, 2016.
- Weighted double q-learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3455–3461, 2017. doi: 10.24963/ijcai.2017/483.
- Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 176–185. PMLR, 06–11 Aug 2017.
- Bias-corrected q-learning to control max-operator bias in q-learning. In Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 2013 - 2013 IEEE Symposium Series on Computational Intelligence, SSCI 2013, IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, ADPRL, pages 93–99, 2013. ISBN 9781467359252. doi: 10.1109/ADPRL.2013.6614994.
- Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, page 202–211, Arlington, Virginia, USA, 2016. AUAI Press. ISBN 9780996643115.
- Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning. PMLR, 2021.
- Ensemble bootstrapping for q-learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8454–8463. PMLR, 18–24 Jul 2021.
- Cross$q$: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PczQtTsTIX.
- Transient non-stationarity and generalisation in deep reinforcement learning. In International Conference on Learning Representations, 2021.
- Understanding and preventing capacity loss in reinforcement learning. In International Conference on Learning Representations, 2021.
- Loss of plasticity in continual deep reinforcement learning, 2023.
- Deep reinforcement learning with plasticity injection. Advances in Neural Information Processing Systems, 36, 2024.
- The dormant neuron phenomenon in deep reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Disentangling the causes of plasticity loss in neural networks, 2024.
- Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations, 2016.
- A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951. doi: 10.1214/aoms/1177729586.
- Learning internal representations by error propagation, page 318–362. MIT Press, Cambridge, MA, USA, 1986. ISBN 026268053X.
- On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
- William C Dabney. Adaptive step-sizes for reinforcement learning. 2014.
- Resetting the optimizer in deep RL: An empirical study. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Tactical optimism and pessimism for deep reinforcement learning. Advances in Neural Information Processing Systems, 2021.
- Playing atari with deep reinforcement learning. In NeurIPS Deep Learning Workshop. 2013.
- Momentum in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, 2020.
- Pid accelerated value iteration algorithm. In International Conference on Machine Learning. PMLR, 2021.
- The phenomenon of policy churn. In Advances in Neural Information Processing Systems, 2022.