Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Proving Linear Mode Connectivity of Neural Networks via Optimal Transport (2310.19103v2)

Published 29 Oct 2023 in cs.LG

Abstract: The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Layerwise linear mode connectivity. arXiv preprint arXiv:2307.06966, 2023.
  2. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
  3. Wasserstein barycenter-based model fusion and linear mode connectivity of neural networks. arXiv preprint arXiv:2210.06671, 2022.
  4. L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  5. Essentially no barriers in neural network energy landscape. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1309–1318. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/draxler18a.html.
  6. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
  7. A general framework for proving the equivariant strong lottery ticket hypothesis. arXiv preprint arXiv:2206.04270, 2022.
  8. J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  9. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
  10. C. D. Freeman and J. Bruna. Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540, 2016.
  11. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  12. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
  13. Using mode connectivity for loss landscape analysis. arXiv preprint arXiv:1806.06977, 2018.
  14. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
  15. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  16. Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
  17. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  18. Explaining landscape connectivity of low-cost solutions for multilayer nets. Advances in neural information processing systems, 32, 2019.
  19. Mechanistic mode connectivity. In International Conference on Machine Learning, pages 22965–23004. PMLR, 2023.
  20. Analyzing monotonic linear interpolation in neural network loss landscapes. arXiv preprint arXiv:2104.11044, 2021.
  21. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.
  22. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  23. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
  24. Linear mode connectivity in multitask and continual learning. arXiv preprint arXiv:2010.04495, 2020.
  25. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  26. A rigorous framework for the mean field limit of multilayer neural networks. Mathematical Statistics and Learning, 6(3):201–357, 2023.
  27. Optimal lottery tickets via subset sum: Logarithmic over-parameterization is sufficient. Advances in neural information processing systems, 33:2599–2610, 2020.
  28. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  29. J. M. Phillips. Chernoff-hoeffding inequality and applications. arXiv preprint arXiv:1209.6396, 2012.
  30. Deep networks on toroids: removing symmetries reveals the structure of flat regions in the landscape geometry. In International Conference on Machine Learning, pages 17759–17781. PMLR, 2022.
  31. Exploring mode connectivity for pre-trained language models. arXiv preprint arXiv:2210.14102, 2022.
  32. Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:10821–10836, 2022.
  33. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
  34. A. Shevchenko and M. Mondelli. Landscape connectivity and dropout stability of sgd solutions for over-parameterized neural networks. In International Conference on Machine Learning, pages 8773–8784. PMLR, 2020.
  35. S. P. Singh and M. Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
  36. Optimizing mode connectivity via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311, 2020.
  37. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20:133, 2019.
  38. C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
  39. T. J. Vlaar and J. Frankle. What can linear interpolation of neural network loss landscapes tell us? In International Conference on Machine Learning, pages 22325–22341. PMLR, 2022.
  40. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440, 2020.
  41. J. Weed and F. Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Advances in Neural Information Processing Systems, 2019.
  42. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
  43. Y. Wu. Packing, covering, and consequences on minimax risk. http://www.stat.yale.edu/∼similar-to\sim∼yw562/ teaching/598/lec14.pdf, 2016. [Online; accessed October-10-2023].
  44. On convexity and linear mode connectivity in neural networks. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
  45. Bayesian nonparametric federated learning of neural networks. In International conference on machine learning, pages 7252–7261. PMLR, 2019.
  46. Bridging mode connectivity in loss landscapes and adversarial robustness. arXiv preprint arXiv:2005.00060, 2020.
  47. Mode connectivity and data heterogeneity of federated learning. arXiv preprint arXiv:2309.16923, 2023a.
  48. Going beyond linear mode connectivity: The layerwise linear feature connectivity. arXiv preprint arXiv:2307.08286, 2023b.
Citations (13)

Summary

We haven't generated a summary for this paper yet.