Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deflated Dynamics Value Iteration (2407.10454v1)

Published 15 Jul 2024 in cs.LG and math.OC

Abstract: The Value Iteration (VI) algorithm is an iterative procedure to compute the value function of a Markov decision process, and is the basis of many reinforcement learning (RL) algorithms as well. As the error convergence rate of VI as a function of iteration $k$ is $O(\gammak)$, it is slow when the discount factor $\gamma$ is close to $1$. To accelerate the computation of the value function, we propose Deflated Dynamics Value Iteration (DDVI). DDVI uses matrix splitting and matrix deflation techniques to effectively remove (deflate) the top $s$ dominant eigen-structure of the transition matrix $\mathcal{P}{\pi}$. We prove that this leads to a $\tilde{O}(\gammak |\lambda_{s+1}|k)$ convergence rate, where $\lambda_{s+1}$is $(s+1)$-th largest eigenvalue of the dynamics matrix. We then extend DDVI to the RL setting and present Deflated Dynamics Temporal Difference (DDTD) algorithm. We empirically show the effectiveness of the proposed algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Multiply accelerated value iteration for non-symmetric affine fixed point problems and application to Markov decision processes. SIAM Journal on Matrix Analysis and Applications, 43(1):199–232, 2022.
  2. Generalized prioritized sweeping. Neural Information Processing Systems, 1997.
  3. Speedy Q-learning. Neural Information Processing Systems, 2011.
  4. P. Bacon. Temporal Representation Learning. PhD thesis, McGill University, 2018.
  5. P. Bacon and D. Precup. A matrix splitting perspective on planning with options. arXiv:1612.00916, 2016.
  6. S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.
  7. D. P. Bertsekas. Generic rank-one corrections for value iteration in markovian decision problems. Operations research letters, 17(3):111–119, 1995.
  8. Neuro-Dynamic Programming. Athena Scientific, 1996.
  9. A finite time analysis of temporal difference learning with linear function approximation. 2018.
  10. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
  11. Vivek S Borkar. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization, 36(3):840–851, 1998.
  12. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  13. Finite-time theory for momentum Q-learning. Conference on Uncertainty in Artificial Intelligence, 2021.
  14. A Brauer’s theorem and related results. Central European Journal of Mathematics, 10:312–321, 2012.
  15. J. Chen and N. Jiang. International conference on machine learning. Information-Theoretic Considerations in Batch Reinforcement Learning, 2019.
  16. Explicit mean-square error bounds for monte-carlo and linear stochastic approximation. International Conference on Artificial Intelligence and Statistics, 2020.
  17. A lyapunov theory for finite-sample guarantees of markovian stochastic approximation. Operations Research, 2023.
  18. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209, 2011.
  19. Finite sample analyses for td (0) with function approximation. Association for the Advancement of Artificial Intelligence, 2018.
  20. Q-learning with uniformly bounded variance. IEEE Transactions on Automatic Control, 67(11):5948–5963, 2021.
  21. M. Ermis and I. Yang. A3DQN: Adaptive Anderson acceleration for deep Q-networks. IEEE Symposium Series on Computational Intelligence, 2020.
  22. On Anderson acceleration for partially observable Markov decision processes. IEEE Conference on Decision and Control, 2021.
  23. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005.
  24. E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 2003.
  25. A theoretical analysis of deep Q-learning. Learning for dynamics and control, 2020.
  26. A. Farahmand and M. Ghavamzadeh. PID accelerated value iteration algorithm. International Conference on Machine Learning, 2021.
  27. Error propagation for approximate policy and value iteration. Neural Information Processing Systems, 2010.
  28. M. Geist and B. Scherrer. Anderson acceleration for reinforcement learning. European Workshop on Reinforcement Learning, 2018.
  29. Matrix Computations. The John Hopkins University Press, 4th edition, 2013.
  30. G. Gordon. Stable function approximation in dynamic programming. International Conference on Machine Learning, 1995.
  31. V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration. Operations Research, 71(2):517–535, 2022.
  32. J. Grand-Clément. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. arXiv:2104.10677, 2021.
  33. An empirical algorithm for relative value iteration for average-cost MDPs. IEEE Conference on Decision and Control, 2015.
  34. N. A. J. Hastings. Some notes on dynamic programming and replacement. Journal of the Operational Research Society, 19:453–464, 1968.
  35. T. Hillen. Elements of Applied Functional Analysis. AMS Open Notes, 2023.
  36. J. K. Hunter and B. Nachtergaele. Applied analysis. World Scientific Publishing Company, 2001.
  37. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 1993.
  38. H. Kushner and A. Kleinman. Numerical methods for the solution of the degenerate nonlinear elliptic equations arising in optimal stochastic control theory. IEEE Transactions on Automatic Control, 1968.
  39. H. Kushner and A. Kleinman. Accelerated procedures for the solution of discrete Markov control problems. IEEE Transactions on Automatic Control, 1971.
  40. M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 2003.
  41. C. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does constant step-size and iterate averaging go? International Conference on Artificial Intelligence and Statistics, 2018.
  42. J. Lee and E. K. Ryu. Accelerating value iteration with anchoring. Neural Information Processing Systems, 2023.
  43. ARPACK Users’ Guide: Solution of Large-scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. Society for Industrial and Applied Mathematics, 1998.
  44. L. Mackey. Deflation methods for sparse PCA. In Neural Information Processing Systems, 2008.
  45. Fast exact planning in Markov decision processes. International Conference on Automated Planning and Scheduling, 2005.
  46. L. Meirovitch. Computational methods in structural dynamics. Springer Science & Business Media, 1980.
  47. S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press, 2022.
  48. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  49. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 1993.
  50. R. B. Morgan. A restarted gmres method augmented with eigenvectors. SIAM Journal on Matrix Analysis and Applications, 16(4):1154–1171, 1995.
  51. R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008.
  52. Anderson acceleration for partially observable Markov decision processes: A maximum entropy approach. arXiv:2211.14998, 2022.
  53. J. Peng and R. J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
  54. E. L. Porteus. Bounds and transformations for discounted finite markov decision chains. Operations Research, 23(4):761–784, 1975.
  55. Operator splitting value iteration. Neural Information Processing Systems, 2022.
  56. D. Reetz. Solution of a markovian decision problem by successive overrelaxation. Zeitschrift für Operations Research, 17:29–32, 1973.
  57. Y. Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011.
  58. A deflated version of the conjugate gradient algorithm. SIAM Journal on Scientific Computing, 21(5):1909–1926, 2000.
  59. Approximate relative value learning for average-reward continuous state mdps. Uncertainty in Artificial Intelligence, 2020.
  60. Regularized Anderson acceleration for off-policy deep reinforcement learning. Neural Information Processing Systems, 2019.
  61. Stabilization of unstable procedures: the recursive projection method. SIAM Journal on numerical analysis, 30(4):1099–1120, 1993.
  62. Variance reduced value iteration and faster algorithms for solving markov decision processes. Naval Research Logistics, 70(5):423–442, 2023.
  63. R. L. Soto and O. Rojo. Applications of a Brauer theorem in the nonnegative inverse eigenvalue problem. Linear algebra and its applications, 416(2-3):844–856, 2006.
  64. Damped Anderson mixing for deep reinforcement learning: Acceleration, convergence, and stabilization. Neural Information Processing Systems, 2021.
  65. R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988.
  66. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
  67. C. Szepesvári. The asymptotic convergence-rate of Q-learning. Neural Information Processing Systems, 1997.
  68. C. Szepesvári. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010.
  69. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 1997.
  70. Momentum in reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020.
  71. M. J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-bounds for q𝑞qitalic_q-learning. arXiv:1905.06265, 2019.
  72. C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, University of Cambride, 1989.
  73. D. J. White. Dynamic programming, markov chains, and the method of successive approximations. J. Math. Anal. Appl, 6(3):373–376, 1963.
  74. Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research, 2005.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets