Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory (2006.04761v2)

Published 8 Jun 2020 in cs.LG, math.OC, and stat.ML

Abstract: Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks. At the core of their empirical successes is the learned feature representation, which embeds rich observations, e.g., images and texts, into the latent space that encodes semantic structures. Meanwhile, the evolution of such a feature representation is crucial to the convergence of temporal-difference and Q-learning. In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise. We aim to answer the following questions: When the function approximator is a neural network, how does the associated feature representation evolve? If it converges, does it converge to the optimal one? We prove that, utilizing an overparameterized two-layer neural network, temporal-difference and Q-learning globally minimize the mean-squared projected BeLLMan error at a sublinear rate. Moreover, the associated feature representation converges to the optimal one, generalizing the previous analysis of Cai et al. (2019) in the neural tangent kernel regime, where the associated feature representation stabilizes at the initial one. The key to our analysis is a mean-field perspective, which connects the evolution of a finite-dimensional parameter to its limiting counterpart over an infinite-dimensional Wasserstein space. Our analysis generalizes to soft Q-learning, which is further connected to policy gradient.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Temporal-difference learning for nonlinear value function approximation in the lazy training regime. arXiv preprint arXiv:1905.10917.
  2. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918.
  3. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962.
  4. A user’s guide to optimal transport. In Modelling and Optimisation of Flows on Networks. Springer, 1–155.
  5. Gradient flows: In metric spaces and in the space of probability measures. Springer.
  6. A mean-field limit for certain deep neural networks. arXiv preprint arXiv:1906.00193.
  7. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems.
  8. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584.
  9. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. arXiv preprint arXiv:1910.01619.
  10. Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning.
  11. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39 930–945.
  12. Bengio, Y. (2012). Deep learning of representations for unsupervised and transfer learning. In ICML Workshop on Unsupervised and Transfer Learning.
  13. Bertsekas, D. P. (2019). Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA Journal of Automatica Sinica, 6 1–31.
  14. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450.
  15. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems.
  16. Borkar, V. S. (2009). Stochastic approximation: A dynamical systems viewpoint. Springer.
  17. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38 447–469.
  18. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems.
  19. Geometric insights into the convergence of nonlinear TD learning. arXiv preprint arXiv:1905.12185.
  20. On the expected dynamics of nonlinear TD learning. arXiv preprint arXiv:1905.12185.
  21. Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
  22. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210.
  23. Mean-field analysis of two-layer neural networks: Non-asymptotic rates and generalization bounds. arXiv preprint arXiv:2002.04026.
  24. How much over-parameterization is sufficient to learn deep ReLU networks? arXiv preprint arXiv:1911.12360.
  25. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425.
  26. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956.
  27. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems.
  28. Conway, J. B. (2019). A course in functional analysis, vol. 96. Springer.
  29. Finite sample analyses for TD(0) with function approximation. In AAAI Conference on Artificial Intelligence.
  30. Daniely, A. (2017). SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems.
  31. Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15 809–883.
  32. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804.
  33. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054.
  34. Over parameterized two-level neural networks can learn near optimal feature representations. arXiv preprint arXiv:1910.11508.
  35. Convex formulation of overparameterized deep neural networks. arXiv preprint arXiv:1911.07626.
  36. Algorithmic survey of parametric value function approximation. IEEE Transactions on Neural Networks and Learning Systems, 24 845–867.
  37. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning.
  38. Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Mathematical Programming, 48 161–220.
  39. Deep reinforcement learning with a natural language action space. arXiv preprint arXiv:1511.04636.
  40. Hinton, G. (1986). Learning distributed representations of concepts. In Annual Conference of Cognitive Science Society.
  41. Holte, J. M. (2009). Discrete Gronwall lemma and applications. In MAA-NCS meeting at the University of North Dakota, vol. 24.
  42. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems.
  43. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems.
  44. Analysis of a two-layer neural network via displacement convexity. arXiv preprint arXiv:1901.01375.
  45. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks. arXiv preprint arXiv:1909.12292.
  46. Actor-critic algorithms. In Advances in Neural Information Processing Systems.
  47. Stochastic approximation and recursive algorithms and applications. Springer.
  48. Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics.
  49. Deep learning. Nature, 521 436–444.
  50. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720.
  51. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17 1334–1373.
  52. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems.
  53. Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. arXiv preprint arXiv:1902.06015.
  54. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 E7665–E7671.
  55. An analysis of reinforcement learning with function approximation. In International Conference on Machine Learning.
  56. Human-level control through deep reinforcement learning. Nature, 518 529–533.
  57. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems.
  58. Combining policy gradient and Q-learning. arXiv preprint arXiv:1611.01626.
  59. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173 361–400.
  60. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8 143–195.
  61. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440.
  62. Asymptotics of reinforcement learning with neural networks. arXiv preprint arXiv:1911.07304.
  63. Finite-time error bounds for linear stochastic approximation and TD learning. arXiv preprint arXiv:1902.00923.
  64. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3 9–44.
  65. Reinforcement learning: An introduction. MIT press.
  66. Sznitman, A.-S. (1991). Topics in propagation of chaos. In Ecole d’Été de Probabilités de Saint-Flour XIX—1989. Springer, 165–251.
  67. Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems.
  68. Villani, C. (2003). Topics in optimal transportation. American Mathematical Society.
  69. Villani, C. (2008). Optimal transport: Old and new. Springer.
  70. Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press.
  71. Q-learning. Machine Learning, 8 279–292.
  72. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems.
  73. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 229–256.
  74. Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. In Advances in Neural Information Processing Systems.
  75. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems.
  76. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888.
  77. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems.
  78. Finite-sample analysis for SARSA and Q-learning with linear function approximation. arXiv preprint arXiv:1902.02234.
Citations (11)

Summary

We haven't generated a summary for this paper yet.