Provably Robust Temporal Difference Learning for Heavy-Tailed Rewards (2306.11455v1)
Abstract: In a broad class of reinforcement learning applications, stochastic rewards have heavy-tailed distributions, which lead to infinite second-order moments for stochastic (semi)gradients in policy evaluation and direct policy optimization. In such instances, the existing RL methods may fail miserably due to frequent statistical outliers. In this work, we establish that temporal difference (TD) learning with a dynamic gradient clipping mechanism, and correspondingly operated natural actor-critic (NAC), can be provably robustified against heavy-tailed reward distributions. It is shown in the framework of linear function approximation that a favorable tradeoff between bias and variability of the stochastic gradients can be achieved with this dynamic gradient clipping mechanism. In particular, we prove that robust versions of TD learning achieve sample complexities of order $\mathcal{O}(\varepsilon{-\frac{1}{p}})$ and $\mathcal{O}(\varepsilon{-1-\frac{1}{p}})$ with and without the full-rank assumption on the feature matrix, respectively, under heavy-tailed rewards with finite moments of order $(1+p)$ for some $p\in(0,1]$, both in expectation and with high probability. We show that a robust variant of NAC based on Robust TD learning achieves $\tilde{\mathcal{O}}(\varepsilon{-4-\frac{2}{p}})$ sample complexity. We corroborate our theoretical results with numerical experiments.
- Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR, 2020.
- Structural properties of optimal transmission policies over a randomly varying channel. IEEE Transactions on Automatic Control, 53(6):1476–1491, 2008.
- Neuro-dynamic programming. Athena Scientific, 1996.
- A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
- Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
- Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
- Neural temporal-difference learning converges to global optima. Advances in Neural Information Processing Systems, 32, 2019.
- Budget-constrained bandits over general cost and reward distributions. In International Conference on Artificial Intelligence and Statistics, pages 4388–4398. PMLR, 2020.
- Value iteration and optimization of multiclass queueing networks. Queueing Systems, 32:65–97, 1999.
- High-probability bounds for non-convex stochastic optimization with heavy tails. Advances in Neural Information Processing Systems, 34:4883–4895, 2021.
- David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
- Algorithm portfolios. Artificial Intelligence, 126(1-2):43–62, 2001.
- Heavy-tailed phenomena in satisfiability and constraint satisfaction problems. Journal of automated reasoning, 24(1-2):67, 2000.
- Boosting combinatorial search through randomization. AAAI/IAAI, 98:431–437, 1998.
- The heavy-tail phenomenon in sgd. In International Conference on Machine Learning, pages 3964–3975. PMLR, 2021.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
- Mor Harchol-Balter. The effect of heavy-tailed job size distributions on computer system design. In Proc. of ASA-IMS Conf. on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics, 1999.
- Stochastic local search: Foundations and applications. Elsevier, 2004.
- Heavy-tailed regression with a generalized median-of-means. In International Conference on Machine Learning, pages 37–45. PMLR, 2014.
- Scheduling algorithms for minimizing age of information in wireless broadcast networks with random arrivals. IEEE Transactions on Mobile Computing, 19(12):2903–2915, 2019.
- Heavy-tailed distributions and robustness in economics and finance, volume 214. Springer, 2015.
- Is aloha causing power law delays? In Managing Traffic Performance in Converged Networks: 20th International Teletraffic Congress, ITC20 2007, Ottawa, Canada, June 17-21, 2007. Proceedings, pages 1149–1160. Springer, 2007.
- Characterizing heavy-tailed distributions induced by retransmissions. Advances in Applied Probability, 45(1):106–138, 2013.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.
- Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- User-friendly covariance estimation for heavy-tailed distributions. Statistical Science, 34(3):454–471, 2019.
- Impedance learning for robotic contact tasks using natural actor-critic algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 40(2):433–443, 2009.
- Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014. Citeseer, 2000.
- Lars Kotthoff. Algorithm selection for combinatorial search problems: A survey. Data mining and constraint programming: Foundations of a cross-disciplinary approach, pages 149–190, 2016.
- Algorithm selection using reinforcement learning. In ICML, pages 511–518, 2000.
- Optimal algorithms for stochastic multi-armed bandits with heavy tailed rewards. Advances in Neural Information Processing Systems, 33:8452–8462, 2020.
- Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
- Testing the covariance stationarity of heavy-tailed time series: An overview of the theory with applications to several financial datasets. Journal of empirical finance, 1(2):211–248, 1994.
- Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190, 2019.
- Benoit Mandelbrot. New methods in statistical economics. Journal of political economy, 71(5):421–440, 1963.
- Moving averages of random vectors with regularly varying tails. Journal of Time Series Analysis, 21(3):297–328, 2000.
- On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pages 6820–6829. PMLR, 2020.
- Sean Meyn. Control techniques for complex networks. Cambridge University Press, 2008.
- Stanislav Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, pages 2308–2335, 2015.
- Stanislav Minsker. Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics, 46(6A):2871–2903, 2018.
- Randomized algorithms. Cambridge university press, 1995.
- The fundamentals of heavy-tails: Properties, emergence, and identification. In Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems, pages 387–388, 2013.
- Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983.
- Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
- Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs. Advances in Neural Information Processing Systems, 31, 2018.
- An online algorithm for maximizing submodular functions. Advances in Neural Information Processing Systems, 21, 2008.
- Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. In NIPs, volume 99, pages 1057–1063. Citeseer, 1999.
- Matus Telgarsky. Stochastic linear optimization never overfits with quadratically-bounded losses on general data. In Conference on Learning Theory, pages 5453–5488. PMLR, 2022.
- Joel A Tropp. Freedman’s inequality for matrix martingales. Electronic Communications in Probability, 16(25):262–270, 2011.
- An analysis of temporal-difference learning with function approximation. IEEE transactions on automatic control, 42(5):674–690, 1997.
- Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- Convergence rates of stochastic gradient descent under infinite noise variance. Advances in Neural Information Processing Systems, 34:18866–18877, 2021.
- Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.
- Ward Whitt. The impact of a heavy-tailed service-time distribution upon the m/gi/s waiting-time distribution. Queueing Systems, 36(1-3):71–87, 2000.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Krylov–bellman boosting: Super-linear policy evaluation in general state spaces. In International Conference on Artificial Intelligence and Statistics, pages 9137–9166. PMLR, 2023.
- Improving sample complexity bounds for (natural) actor-critic algorithms. arXiv preprint arXiv:2004.12956, 2020.
- Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. arXiv preprint arXiv:2005.03557, 2020.