Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic (2112.13530v2)

Published 27 Dec 2021 in cs.LG, math.OC, and stat.ML

Abstract: Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Temporal-difference learning for nonlinear value function approximation in the lazy training regime. arXiv preprint arXiv:1905.10917.
  2. Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime. arXiv preprint arXiv:2010.11858.
  3. Solving the Rubik’s cube with deep reinforcement learning and search. Nature Machine Intelligence, 1 356–363.
  4. Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113.
  5. A user’s guide to optimal transport. In Modelling and Optimisation of Flows on Networks. Springer, 1–155.
  6. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer.
  7. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 89–129.
  8. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39 930–945.
  9. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
  10. Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems.
  11. Natural actor-critic algorithms. Automatica, 45 2471–2482.
  12. Learning through reinforcement and replicator dynamics. Journal of Economic Theory, 77 1–14.
  13. Borkar, V. S. (2009). Stochastic Approximation: A Dynamical Systems Viewpoint. Springer.
  14. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
  15. Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems.
  16. On computation and generalization of generative adversarial imitation learning. arXiv preprint arXiv:2001.02792.
  17. Mean-field analysis of two-layer neural networks: Non-asymptotic rates and generalization bounds. arXiv preprint arXiv:2002.04026.
  18. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956.
  19. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems.
  20. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning.
  21. Over parameterized two-level neural networks can learn near optimal feature representations. arXiv preprint arXiv:1910.11508.
  22. Convex formulation of overparameterized deep neural networks. arXiv preprint arXiv:1911.07626.
  23. Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17 4809–4874.
  24. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
  25. Friedrichs, K. O. (1944). The identity of weak and strong extensions of differential operators. Transactions of the American Mathematical Society, 55 132–151.
  26. Single-timescale actor-critic provably finds globally optimal policy. arXiv preprint arXiv:2008.00483.
  27. Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Mathematical Programming, 48 161–220.
  28. Neural replicator dynamics: Multiagent learning via hedging policy gradients. In International Conference on Autonomous Agents and MultiAgent Systems.
  29. A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170.
  30. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems.
  31. Analysis of a two-layer neural network via displacement convexity. arXiv preprint arXiv:1901.01375.
  32. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory.
  33. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, vol. 2.
  34. Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
  35. Finite sample analysis of two-time-scale natural actor-critic algorithm. arXiv preprint arXiv:2101.10506.
  36. Actor-critic algorithms. In Advances in Neural Information Processing Systems.
  37. Stochastic Approximation and Recursive Algorithms and Applications. Springer.
  38. Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17 583–612.
  39. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
  40. A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning. PMLR.
  41. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory. PMLR.
  42. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 E7665–E7671.
  43. Finite-time bounds for fitted value iteration. The Journal of Machine Learning Research, 9 815–857.
  44. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173 361–400.
  45. Natural actor-critic. Neurocomputing, 71 1180–1190.
  46. Peyre, R. (2011). Comparison between w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance and H˙−1superscript˙𝐻1\dot{H}^{-1}over˙ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT norm, and localisation of wasserstein distance. arXiv preprint arXiv:1104.4631.
  47. Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8 143–195.
  48. Approximate modified policy iteration and its application to the game of Tetris. The Journal of Machine Learning Research, 16 1629–1676.
  49. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  50. Replicator dynamics. Journal of Theoretical Biology, 100 533–538.
  51. Mastering the game of Go with deep neural networks and tree search. Nature, 529 484–489.
  52. Mastering the game of Go without human knowledge. Nature, 550 354.
  53. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130 1820–1852.
  54. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 725–752.
  55. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3 9–44.
  56. Reinforcement Learning: An Introduction. MIT press.
  57. Finite time bounds for sampling based fitted value iteration. In International Conference on Machine Learning. ACM.
  58. Villani, C. (2003). Topics in Optimal Transportation. American Mathematical Society.
  59. Villani, C. (2008). Optimal Transport: Old and New. Springer.
  60. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 350–354.
  61. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
  62. Q-learning. Machine Learning, 8 279–292.
  63. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems.
  64. A finite time analysis of two time-scale actor critic methods. arXiv preprint arXiv:2005.01350.
  65. Improving sample complexity bounds for actor-critic algorithms. arXiv preprint arXiv:2004.12956.
  66. Non-asymptotic convergence analysis of two time-scale (natural) actor-critic algorithms. arXiv preprint arXiv:2005.03557.
  67. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
  68. Sample-optimal parametric q-learning using linearly additive features. arXiv preprint arXiv:1902.04779.
  69. Can temporal-difference and Q-learning learn representation? A mean-field theory. arXiv preprint arXiv:2006.04761.
Citations (4)

Summary

We haven't generated a summary for this paper yet.