Improving Convergence and Generalization Using Parameter Symmetries (2305.13404v3)
Abstract: In many neural networks, different values of the parameters may result in the same loss value. Parameter space symmetries are loss-invariant transformations that change the model parameters. Teleportation applies such transformations to accelerate optimization. However, the exact mechanism behind this algorithm's success is not well understood. In this paper, we show that teleportation not only speeds up optimization in the short-term, but gives overall faster time to convergence. Additionally, teleporting to minima with different curvatures improves generalization, which suggests a connection between the curvature of the minimum and generalization ability. Finally, we show that integrating teleportation into a wide range of optimization algorithms and optimization-based meta-learning improves convergence. Our results showcase the versatility of teleportation and demonstrate the potential of incorporating symmetry in optimization.
- Osmar Aléssio. Formulas for second curvature, third curvature, normal curvature, first geodesic curvature and first geodesic torsion of implicit curve in n-dimensions. Computer Aided Geometric Design, 29(3-4):189–201, 2012.
- A modern look at the relationship between sharpness and generalization. International Conference on Machine Learning, 2023.
- Learning to learn by gradient descent by gradient descent. Advances in Neural Information Processing Systems, 29, 2016.
- The representation theory of neural networks. Mathematics, 9(24), 2021. ISSN 2227-7390.
- Neural teleportation. Mathematics, 11(2):480, 2023.
- Symmetry-invariant optimization in deep networks. arXiv preprint arXiv:1511.01754, 2015.
- Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3:747–769, 2021.
- Entropy-sgd: Biasing gradient descent into wide valleys. International Conference on Learning Representations, 2017.
- Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
- Flat minima generalize for low-rank matrix recovery. arXiv preprint arXiv:2203.03756, 2022.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pp. 1019–1028. PMLR, 2017.
- Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.
- Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Neural Information Processing Systems, 2018.
- The role of permutation invariance in linear mode connectivity of neural networks. International Conference on Learning Representations, 2022.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126–1135. PMLR, 2017.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021.
- Quiver neural networks. arXiv preprint arXiv:2207.12773, 2022.
- Universal approximation and model compression for radial neural networks. arXiv preprint arXiv:2107.02550v2, 2022.
- Functional dimension of feedforward relu neural networks. arXiv preprint arXiv:2209.04036, 2022.
- Flat minima. Neural computation, 9(1):1–42, 1997.
- Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artificial Intelligence, 2018.
- On large-batch training for deep learning: Generalization gap and sharp minima. International Conference on Learning Representations, 2017.
- Fisher sam: Information geometry and sharpness aware minimisation. In International Conference on Machine Learning, pp. 11148–11161. PMLR, 2022.
- Learning multiple layers of features from tiny images. 2009.
- Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In International Conference on Learning Representations, 2021.
- Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning, pp. 5905–5914. PMLR, 2021.
- John M Lee. Introduction to Smooth Manifolds. Graduate Texts in Mathematics, vol 218. Springer, New York, NY, 2013.
- Sgdr: Stochastic gradient descent with warm restarts. International Conference on Learning Representations, 2017.
- 𝒢𝒢\mathcal{G}caligraphic_G-SGD: Optimizing relu neural networks in its positively scale-invariant space. International Conference on Learning Representations, 2019.
- Path-SGD: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, 2015.
- Relative flatness and generalization. 35th Conference on Neural Information Processing Systems, 2021.
- How you start matters for generalization. arXiv preprint arXiv:2206.08558, 2022.
- Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- Aleksandr Mikhailovich Shelekhov. On the curvatures of a curve in n-dimensional euclidean space. Russian Mathematics, 65(11):46–58, 2021.
- Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pp. 9722–9732. PMLR, 2021.
- Sebastian U. Stich. Unified optimal analysis of the (stochastic) gradient method. CoRR, 2019.
- Understanding the dynamics of gradient flow in overparameterized linear models. In International Conference on Machine Learning, pp. 10153–10161. PMLR, 2021.
- Twan Van Laarhoven. L2 regularization versus batch and weight normalization. Advances in Neural Information Processing Systems, 2017.
- Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Symmetry teleportation for accelerated optimization. Advances in Neural Information Processing Systems, 2022.
- Symmetries, flat minima, and the conserved quantities of gradient flow. International Conference on Learning Representations, 2023.
- Towards theoretically understanding why sgd generalizes better than adam in deep learning. Advances in Neural Information Processing Systems, 33:21285–21296, 2020.