A Case for Validation Buffer in Pessimistic Actor-Critic (2403.01014v1)
Abstract: In this paper, we investigate the issue of error accumulation in critic networks updated via pessimistic temporal difference objectives. We show that the critic approximation error can be approximated via a recursive fixed-point model similar to that of the Bellman value. We use such recursive definition to retrieve the conditions under which the pessimistic critic is unbiased. Building on these insights, we propose Validation Pessimism Learning (VPL) algorithm. VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized. We investigate the proposed approach on a variety of locomotion and manipulation tasks and report improvements in sample efficiency and performance.
- Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320, 2021.
- What matters in on-policy reinforcement learning? a large-scale empirical study. In ICLR 2021-Ninth International Conference on Learning Representations, 2021.
- Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, pp. 264–273. PMLR, 2018.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Ready policy one: World building through active learning. In International Conference on Machine Learning, pp. 591–601. PMLR, 2020.
- Efficient online reinforcement learning with offline data. 2023.
- Distributed distributional deterministic policy gradients. In International Conference on Learning Representations, 2018.
- A distributional perspective on reinforcement learning. In International conference on machine learning, pp. 449–458. PMLR, 2017.
- Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.
- Pattern recognition and machine learning, volume 4. Springer, 2006.
- Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in neural information processing systems, 31, 2018.
- Learning pessimism for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 6971–6979, 2023.
- Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations, 2020.
- Expected policy gradients for reinforcement learning. Journal of Machine Learning Research, 21(2020), 2020.
- Better exploration with optimistic actor critic. Advances in Neural Information Processing Systems, 32, 2019.
- On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105:589–608, 2000.
- Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2022.
- Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23, 2010.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- Spectral normalisation for deep reinforcement learning: an optimisation perspective. In International Conference on Machine Learning, pp. 3734–3744. PMLR, 2021.
- Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31, 2018.
- Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp. 1352–1361. PMLR, 2017.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Hasselt, H. Double q-learning. Advances in neural information processing systems, 23, 2010.
- Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Dropout q-functions for doubly efficient reinforcement learning. In International Conference on Learning Representations, 2021.
- When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 32, 2019.
- Continuous control with ensemble deep deterministic policy gradients. In Deep RL Workshop NeurIPS 2021, 2021.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kostrikov, I. JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10 2021. URL https://github.com/ikostrikov/jaxrl.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
- Discor: Corrective feedback in reinforcement learning via distribution correction. Advances in Neural Information Processing Systems, 33:18560–18572, 2020.
- Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning, pp. 5556–5566. PMLR, 2020.
- Automating control of overestimation bias for reinforcement learning. arXiv preprint arXiv:2110.13523, 2021.
- Plastic: Improving input and label plasticity for sample efficient reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pp. 6131–6141. PMLR, 2021.
- Efficient deep reinforcement learning requires regulating overfitting. In The Eleventh International Conference on Learning Representations, 2022.
- Understanding plasticity in neural networks. arXiv preprint arXiv:2303.01486, 2023.
- Discovering and achieving goals via world models. Advances in Neural Information Processing Systems, 34:24379–24391, 2021.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Tactical optimism and pessimism for deep reinforcement learning. Advances in Neural Information Processing Systems, 34:12849–12863, 2021.
- Munos, R. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, volume 20, pp. 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005.
- Munos, R. Performance bounds in l_p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541–561, 2007.
- Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
- The primacy bias in deep reinforcement learning. In International conference on machine learning, pp. 16828–16847. PMLR, 2022.
- Trust the model when it is confident: Masked model-based actor-critic. Advances in neural information processing systems, 33:10537–10546, 2020.
- Prechelt, L. Early stopping-but when? In Neural Networks: Tricks of the trade, pp. 55–69. Springer, 2002.
- Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- A constructive prediction of the generalization error across scales. In International Conference on Learning Representations, 2019.
- Statistics and samples in distributional reinforcement learning. In International Conference on Machine Learning, pp. 5528–5536. PMLR, 2019.
- An analysis of quantile temporal-difference learning. arXiv preprint arXiv:2301.04462, 2023.
- Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles. In Conference on Robot Learning, pp. 1156–1167. PMLR, 2022.
- Understanding and improving convolutional neural networks via concatenated rectified linear units. In international conference on machine learning, pp. 2217–2225. PMLR, 2016.
- Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. PMLR, 2014.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 connectionist models summer school, pp. 255–263. Psychology Press, 2014.
- Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Van Roy, B. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
- Sample-efficient reinforcement learning via conservative model-based actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8612–8620, 2022.
- Q-learning. Machine learning, 8:279–292, 1992.
- Sample efficient reinforcement learning via model-ensemble exploration and exploitation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 4202–4208. IEEE, 2021.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp. 1094–1100. PMLR, 2020a.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020b.
- Weighted double q-learning. In IJCAI, pp. 3455–3461, 2017.
- Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.