Unnatural Algorithms in Machine Learning (2312.04739v1)
Abstract: Natural gradient descent has a remarkable property that in the small learning rate limit, it displays an invariance with respect to network reparameterizations, leading to robust training behavior even for highly covariant network parameterizations. We show that optimization algorithms with this property can be viewed as discrete approximations of natural transformations from the functor determining an optimizer's state space from the diffeomorphism group if its configuration manifold, to the functor determining that state space's tangent bundle from this group. Algorithms with this property enjoy greater efficiency when used to train poorly parameterized networks, as the network evolution they generate is approximately invariant to network reparameterizations. More specifically, the flow generated by these algorithms in the limit as the learning rate vanishes is invariant under smooth reparameterizations, the respective flows of the parameters being determined by equivariant maps. By casting this property a natural transformation, we allow for generalizations beyond equivariance with respect to group actions; this framework can account for non-invertible maps such as projections, creating a framework for the direct comparison of training behavior across non-isomorphic network architectures, and the formal examination of limiting behavior as network size increases by considering inverse limits of these projections, should they exist. We introduce a simple method of introducing this naturality more generally and examine a number of popular machine learning training algorithms, finding that most are unnatural.
- Abstract and concrete categories: the joy of cats. Repr Theory Appl Categ, 17:1–507, 2006.
- Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- Shun-ichi Amari. Information geometry in optimization, machine learning and statistical inference. Frontiers of Electrical and Electronic Engineering in China, 5:241–260, 2010.
- Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Abstract algebra, volume 3. Wiley Hoboken, 2004.
- Basic group theory.
- General theory of natural equivalences. Transactions of the American Mathematical Society, 58:231–294, 1945.
- Bfgs optimization for faster and automated supervised learning. In International Neural Network Conference: July 9–13, 1990 Palais Des Congres—Paris—France, pages 757–760. Springer, 1990.
- Fundamentals of the Theory of Groups, volume 62. Springer, 1979.
- Global linear convergence of newton’s method without strong-convexity or lipschitz gradients. arXiv preprint arXiv:1806.00413, 2018.
- Adam: A method for stochastic optimization, 2017.
- Natural operations in differential geometry. Springer Science & Business Media, 2013.
- Stochastic newton and cubic newton methods with simple local linear-quadratic rates, 2019.
- On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
- Saunders Mac Lane. Categories for the working mathematician, volume 5. Springer Science & Business Media, 2013.
- James Martens. New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851, 2020.
- Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1033–1040, 2011.
- James Martens et al. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742, 2010.
- Iterative scaled trust-region learning in krylov subspaces via pearlmutter’s implicit sparse hessian. Advances in neural information processing systems, 16, 2003.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- Information-geometric optimization algorithms: A unifying picture via invariance principles. The Journal of Machine Learning Research, 18(1):564–628, 2017.
- Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
- Emily Riehl. Category theory in context. Courier Dover Publications, 2017.
- Joseph J Rotman. An introduction to the theory of groups, volume 148. Springer Science & Business Media, 2012.
- Topmoumoute online natural gradient algorithm. Advances in neural information processing systems, 20, 2007.
- Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002.
- Jascha Sohl-Dickstein. The natural gradient by analogy to signal whitening, and recipes and tricks for its use. arXiv preprint arXiv:1205.1828, 2012.
- A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights, 2015.
- Krylov subspace descent for deep learning. In Artificial intelligence and statistics, pages 1261–1268. PMLR, 2012.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.