Layer-wise Linear Mode Connectivity (2307.06966v3)
Abstract: Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models. It is most prominently used in federated learning. If models are averaged at the end of training, this can only lead to a good performing model if the loss surface of interest is very particular, i.e., the loss in the midpoint between the two models needs to be sufficiently low. This is impossible to guarantee for the non-convex losses of state-of-the-art networks. For averaging models trained on vastly different datasets, it was proposed to average only the parameters of particular layers or combinations of layers, resulting in better performing models. To get a better understanding of the effect of layer-wise averaging, we analyse the performance of the models that result from averaging single layers, or groups of layers. Based on our empirical and theoretical investigation, we introduce a novel notion of the layer-wise linear connectivity, and show that deep networks do not have layer-wise barriers between them.
- Git re-basin: Merging models modulo permutation symmetries. In International Conference on Learning Representations, 2022.
- A modern look at the relationship between sharpness and generalization. In International Conference on Machine Learning. PMLR, 2023a.
- Sgd with large step sizes learns sparse features. In International Conference on Machine Learning. PMLR, 2023b.
- Federated learning with personalization layers. arXiv preprint arXiv:1912.00818, 2019.
- Revisiting model stitching to compare neural representations. In Advances in neural information processing systems, volume 34, pp. 225–236, 2021.
- Pythia: A suite for analyzing large language models across training and scaling, 2023.
- The intriguing role of module criticality in the generalization of deep networks. In International Conference on Learning Representations, 2020.
- On the importance and applicability of pre-training for federated learning. In International Conference on Learning Representations, 2022.
- Clip itself is a strong fine-tuner: Achieving 85.7% and 88.0% top-1 accuracy with vit-b and vit-l on imagenet. arXiv preprint arXiv:2212.06138, 2022.
- Essentially no barriers in neural network energy landscape. In International conference on machine learning, pp. 1309–1318. PMLR, 2018.
- The role of permutation invariance in linear mode connectivity of neural networks. In International Conference on Learning Representations, 2022.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
- Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in neural information processing systems, volume 31, 2018.
- How much data are augmentations worth? an investigation into scaling laws, invariance, and implicit regularization. In ICLR, 2023.
- Editing models with task arithmetic. In International Conference on Learning Representations, 2022.
- Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp. 876–885. Association For Uncertainty in Artificial Intelligence (AUAI), 2018.
- Arthur Jacot. Bottleneck structure in learned features: Low-dimension vs regularity tradeoff. arXiv preprint arXiv:2305.19008, 2023.
- Repair: Renormalizing permuted activations for interpolation repair. In International Conference on Learning Representations, 2022.
- Fedbn: Federated learning on non-iid features via local batch normalization. In International Conference on Learning Representations, 2021.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Think locally, act globally: Federated learning with local and global representations. arXiv preprint arXiv:2001.01523, 2020.
- Bad global minima exist and sgd can reach them. Advances in Neural Information Processing Systems, 33:8543–8552, 2020.
- Mechanistic mode connectivity. In International Conference on Machine Learning, pp. 22965–23004. PMLR, 2023.
- Layer-wised model aggregation for personalized federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10092–10101, 2022.
- Can neural network memorization be localized? In International Conference on Machine Learning, 2023.
- Normalization layers are all that sharpness-aware minimization needs. arXiv preprint arXiv:2306.04226, 2023.
- Fedbabu: Toward enhanced representation for federated image classification. In International Conference on Learning Representations, 2021.
- Task arithmetic in the tangent space: Improved editing of pre-trained models. arXiv preprint arXiv:2305.12827, 2023.
- Relative flatness and generalization. In Advances in neural information processing systems, volume 34, pp. 18420–18432, 2021.
- Federated learning with partial model personalization. In International Conference on Machine Learning, pp. 17716–17758. PMLR, 2022.
- Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in neural information processing systems, volume 30, 2017.
- Revisiting adapters with adversarial training. In International Conference on Learning Representations, 2023.
- The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks. In Conference on Learning Theory, pp. 3889–3934. PMLR, 2021.
- Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pp. 9722–9732. PMLR, 2021.
- Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
- Robustness may be at odds with accuracy. In International Conference on Learning Representations, 2018.
- What can linear interpolation of neural network loss landscapes tell us? In International Conference on Machine Learning, pp. 22325–22341. PMLR, 2022.
- Federated learning with matched averaging. In International Conference on Learning Representations, 2020.
- Learning neural network subspaces. In International Conference on Machine Learning, pp. 11217–11227. PMLR, 2021.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022.
- How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, volume 31, 2018.
- A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.
- Robustness and generalization. Machine learning, 86:391–423, 2012.
- Zongwei Zhou Yixiong Chen, Alan Yuille. Which layer is learning faster? a systematic exploration of layer-wise convergence rate for deep neural networks. In International Conference on Learning Representations, 2023.
- On convexity and linear mode connectivity in neural networks. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
- Are all layers created equal? Journal of Machine Learning Research, 23(67):1–28, 2022.
- Going beyond linear mode connectivity: The layerwise linear feature connectivity. In Advances in neural information processing systems, 2023.
- Decentralized sgd and average-direction sam are asymptotically equivalent. In International Conference on Machine Learning, 2023.