Normality-Guided Distributional Reinforcement Learning for Continuous Control (2208.13125v3)
Abstract: Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) has been shown to improve performance by modeling the value distribution, not just the mean. We study the value distribution in several continuous control tasks and find that the learned value distribution is empirical quite close to normal. We design a method that exploits this property, employ variances predicted from a variance network, along with returns, to analytically compute target quantile bars representing a normal for our distributional value function. In addition, we propose a policy update strategy based on the correctness as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds. Our method yields statistically significant improvements in 10 out of 16 continuous task settings, while utilizing a reduced number of weights and achieving faster training time compared to an ensemble-based method for quantifying value distribution uncertainty.
- Joshua Achiam. Spinning Up in Deep Reinforcement Learning, 2018. OpenAI.
- Sparse quantile huber regression for efficient and robust estimation. ArXiv, abs/1402.4624, 2014.
- Distributed distributional deterministic policy gradients. CoRR, abs/1804.08617, 2018.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
- Unifying count-based exploration and intrinsic motivation. CoRR, abs/1606.01868, 2016.
- A distributional perspective on reinforcement learning. CoRR, abs/1707.06887, 2017.
- The cramer distance as a solution to biased wasserstein gradients. CoRR, abs/1705.10743, 2017.
- Openai gym. CoRR, abs/1606.01540, 2016.
- Exploration by random network distillation. CoRR, abs/1810.12894, 2018.
- Estimating risk and uncertainty in deep reinforcement learning, 2019.
- Distributional reinforcement learning with quantile regression. CoRR, abs/1710.10044, 2017.
- Addressing function approximation error in actor-critic methods. CoRR, abs/1802.09477, 2018.
- Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290, 2018.
- Stephen C. Hora. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. Reliability Engineering & System Safety, 54(2):217–223, 1996. Treatment of Aleatory and Epistemic Uncertainty.
- Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 – 101, 1964.
- Galin L. Jones. On the Markov chain central limit theorem. Probability Surveys, 1(none):299 – 320, 2004.
- What uncertainties do we need in bayesian deep learning for computer vision? CoRR, abs/1703.04977, 2017.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Aleatory or epistemic? does it matter? Structural Safety, 31(2):105–112, 2009. Risk Acceptance and Risk Communication.
- Simple and scalable predictive uncertainty estimation using deep ensembles, 2016.
- SUNRISE: A simple unified framework for ensemble learning in deep reinforcement learning. CoRR, abs/2007.04938, 2020.
- Sample efficient deep reinforcement learning via uncertainty estimation. CoRR, abs/2201.01666, 2022.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
- Efficient exploration with double uncertain value networks. CoRR, abs/1711.10789, 2017.
- Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 1, pages 55–60 vol.1, 1994.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Trust region policy optimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- Sample-based distributional policy gradient. CoRR, abs/2001.02652, 2020.
- A theoretical analysis of model-based interval estimation. In ICML, pages 856–863, 2005.
- Reinforcement learning: An introduction. MIT press, 2018.
- Philip Thomas. Bias in natural actor-critic algorithms. In International conference on machine learning, pages 441–448. PMLR, 2014.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
- Q-learning. Machine learning, 8(3):279–292, 1992.
- Implicit distributional reinforcement learning. CoRR, abs/2007.06159, 2020.