Warm-Start Actor-Critic: From Approximation Error to Sub-optimality Gap (2306.11271v1)
Abstract: Warm-Start reinforcement learning (RL), aided by a prior policy obtained from offline training, is emerging as a promising RL approach for practical applications. Recent empirical studies have demonstrated that the performance of Warm-Start RL can be improved \textit{quickly} in some cases but become \textit{stagnant} in other cases, especially when the function approximation is used. To this end, the primary objective of this work is to build a fundamental understanding on ``\textit{whether and when online learning can be significantly accelerated by a warm-start policy from offline RL?}''. Specifically, we consider the widely used Actor-Critic (A-C) method with a prior policy. We first quantify the approximation errors in the Actor update and the Critic update, respectively. Next, we cast the Warm-Start A-C algorithm as Newton's method with perturbation, and study the impact of the approximation errors on the finite-time learning performance with inaccurate Actor/Critic updates. Under some general technical conditions, we derive the upper bounds, which shed light on achieving the desired finite-learning performance in the Warm-Start A-C algorithm. In particular, our findings reveal that it is essential to reduce the algorithm bias in online learning. We also obtain lower bounds on the sub-optimality gap of the Warm-Start A-C algorithm to quantify the impact of the bias and error propagation.
- Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp. 64–66. PMLR, 2020.
- On the convergence of SGD with biased gradients. arXiv preprint arXiv:2008.00051, 2020.
- Policy search by dynamic programming. Advances in neural information processing systems, 16, 2003.
- Bertsekas, D. Abstract Dynamic Programming. Athena Scientific, 2022a.
- Bertsekas, D. Lessons from Alphazero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific, 2022b.
- A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pp. 1691–1692. PMLR, 2018.
- Convex Optimization. Cambridge university press, 2004.
- Reinforcement learning and savings behavior. The Journal of finance, 64(6):2515–2534, 2009.
- Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Conference On Learning Theory, pp. 1199–1233. PMLR, 2018.
- A tale of two-timescale reinforcement learning with the tightest finite-time bound. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3701–3708, 2020.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 2018.
- Hoeffding’s inequality for general markov chains and its applications to statistical learning. Journal of Machine Learning Research, 22(139):1–35, 2021.
- Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23, 2010.
- Single-timescale actor-critic provably finds globally optimal policy. In International Conference on Learning Representations, 2020.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Dynamic programming through the lens of semismooth newton-type methods. IEEE Control Systems Letters, 2022.
- Combining online and offline knowledge in UCT. In Proceedings of the 24th international conference on Machine learning, pp. 273–280, 2007.
- Grand-Clément, J. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. arXiv preprint arXiv:2104.10677, 2021.
- A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6):1291–1307, 2012.
- Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. In Conference on Robot Learning, pp. 1025–1037. PMLR, 2020.
- Learning attractor landscapes for learning motor primitives. Advances in neural information processing systems, 15, 2002.
- Bernstein’s inequality for general markov chains. arXiv preprint arXiv:1805.10721, 2018.
- Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
- Finite sample analysis of two-time-scale natural actor-critic algorithm. IEEE Transactions on Automatic Control, 2022.
- Learning from limited demonstrations. Advances in Neural Information Processing Systems, 26, 2013.
- Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- Discor: Corrective feedback in reinforcement learning via distribution correction. Advances in Neural Information Processing Systems, 33:18560–18572, 2020.
- On the sample complexity of actor-critic method for reinforcement learning with function approximation. Machine Learning, pp. 1–35, 2023.
- Analysis of a classification-based policy iteration algorithm. In 27th International Conference on Machine Learning, pp. 607–614. Omnipress, 2010.
- Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pp. 1702–1712. PMLR, 2022.
- Markov Chains and Mixing Times, volume 107. American Mathematical Soc., 2017.
- Lezaud, P. Chernoff-type bound for finite markov chains. Annals of Applied Probability, pp. 849–867, 1998.
- Li, Y. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
- Munos, R. Error bounds for approximate policy iteration. In ICML, volume 3, pp. 560–567. Citeseer, 2003.
- Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2003.
- Ortner, R. Regret bounds for reinforcement learning via markov chain concentration. Journal of Artificial Intelligence Research, 67:115–128, 2020.
- Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
- Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- On the convergence of policy iteration in stationary dynamic programming. Mathematics of Operations Research, 4(1):60–69, 1979.
- Convergence properties of policy iteration. SIAM Journal on Control and Optimization, 42(6):2094–2115, 2004.
- Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
- Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. PMLR, 2014.
- Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718, 2022.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, volume 6, 1993.
- Jump-start reinforcement learning. arXiv preprint arXiv:2204.02372, 2022.
- Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Leveraging offline data in online reinforcement learning. arXiv preprint arXiv:2211.04974, 2022.
- Flow: Architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint arXiv:1710.05465, 10, 2017.
- Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34, 2021.
- Improving sample complexity bounds for (natural) actor-critic algorithms. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020a.
- Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. arXiv preprint arXiv:2005.03557, 2020b.
- Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. Advances in neural information processing systems, 32, 2019.
- Finite-sample analysis for SARSA with linear function approximation. Advances in neural information processing systems, 32, 2019.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.