Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transferring Learning Trajectories of Neural Networks (2305.14122v2)

Published 23 May 2023 in cs.LG, cs.AI, and stat.ML

Abstract: Training deep neural networks (DNNs) is computationally expensive, which is problematic especially when performing duplicated or similar training runs in model ensemble or fine-tuning pre-trained models, for example. Once we have trained one DNN on some dataset, we have its learning trajectory (i.e., a sequence of intermediate parameters during training) which may potentially contain useful information for learning the dataset. However, there has been no attempt to utilize such information of a given learning trajectory for another training. In this paper, we formulate the problem of "transferring" a given learning trajectory from one initial parameter to another one (learning transfer problem) and derive the first algorithm to approximately solve it by matching gradients successively along the trajectory via permutation symmetry. We empirically show that the transferred parameters achieve non-trivial accuracy before any direct training, and can be trained significantly faster than training from scratch.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CQsmMYmlP5T.
  2. On warm-starting neural network training. Advances in neural information processing systems, 33:3884–3894, 2020.
  3. Loss surface simplexes for mode connecting volumes and fast ensembling. In International Conference on Machine Learning, pp.  769–779. PMLR, 2021.
  4. Random initialisations performing above chance and how to find them. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
  5. On the reproducibility of neural network predictions. arXiv preprint arXiv:2102.03349, 2021.
  6. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pp.  192–204. PMLR, 2015.
  7. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  8. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pp.  1309–1318. PMLR, 2018.
  9. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
  10. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
  11. Jonathan Frankle. Revisiting ”qualitatively characterizing neural network optimization problems”. In NeurIPS 2020 Workshop: Deep Learning through Information Geometry, 2020. URL https://openreview.net/forum?id=0mu8aLQ3Kyc.
  12. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.  3259–3269. PMLR, 2020.
  13. Topology and geometry of half-rectified network optimization. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Bk0FWVcgx.
  14. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  15. Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations, 2015.
  16. Global optimality in neural network training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  7331–7339, 2017.
  17. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp.  1026–1034, 2015.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  19. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  20. Flat minima. Neural computation, 9(1):1–42, 1997.
  21. Snapshot ensembles: Train 1, get m for free. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJYwwY9ll.
  22. Patching open-vocabulary models by interpolating weights. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=CZZFRxbOLC.
  23. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj.
  24. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp.  448–456. pmlr, 2015.
  25. Keller Jordan. Calibrated chaos: Variance between runs of neural network training is harmless and inevitable. arXiv preprint arXiv:2304.01910, 2023.
  26. REPAIR: REnormalizing permuted activations for interpolation repair. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=gU5sJ6ZggcX.
  27. Linear connectivity reveals generalization strategies. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hY6M0JHl3uL.
  28. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1oyRlYgg.
  29. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp.  554–561, 2013.
  30. Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009. URL https://www.cs.toronto.edu/~kriz/cifar.html.
  31. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  32. Mnist handwritten digit database, 1998. URL http://yann.lecun.com/exdb/mnist/.
  33. Gradient descent only converges to minimizers. In Conference on learning theory, pp.  1246–1257. PMLR, 2016.
  34. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
  35. Branch-train-merge: Embarrassingly parallel training of expert language models, 2023. URL https://openreview.net/forum?id=I8ly64E5Nt.
  36. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30, 2017.
  37. Knowledge distillation for efficient sequences of training runs. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, 2022. URL https://openreview.net/forum?id=kksQ0J87f03.
  38. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Skq89Scxx.
  39. Mechanistic mode connectivity. 2023.
  40. On monotonic linear interpolation of neural network parameters. In International Conference on Machine Learning, pp.  7168–7179. PMLR, 2021.
  41. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
  42. Linear mode connectivity in multitask and continual learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Fmg_fQYUejf.
  43. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
  44. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  45. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  46. Editing a classifier by rewriting its prediction rules. Advances in Neural Information Processing Systems, 34:23359–23373, 2021.
  47. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pp.  9722–9732. PMLR, 2021.
  48. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
  49. Editable neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJedXaEtvS.
  50. Nondeterminism and instability in neural network optimization. In International Conference on Machine Learning, pp.  9913–9922. PMLR, 2021.
  51. What can linear interpolation of neural network loss landscapes tell us? In International Conference on Machine Learning, pp.  22325–22341. PMLR, 2022.
  52. The caltech-ucsd birds-200-2011 dataset. (CNS-TR-2011-001), 2011.
  53. Plateau in monotonic linear interpolation — a ”biased” view of loss landscape for deep networks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=z289SIQOQna.
  54. Learning neural network subspaces. In International Conference on Machine Learning, pp.  11217–11227. PMLR, 2021.
  55. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp.  23965–23998. PMLR, 2022.
  56. See through gradients: Image batch recovery via gradinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16337–16346, 2021.
  57. Global optimality conditions for deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJk7Gf-CZ.
  58. Dataset condensation with gradient matching. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=mSAKhLYLSsl.
Citations (1)

Summary

We haven't generated a summary for this paper yet.