Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Expressive Power of Tuning Only the Normalization Layers

Published 15 Feb 2023 in cs.LG, cs.AI, and stat.ML | (2302.07937v2)

Abstract: Feature normalization transforms such as Batch and Layer-Normalization have become indispensable ingredients of state-of-the-art deep neural networks. Recent studies on fine-tuning large pretrained models indicate that just tuning the parameters of these affine transforms can achieve high accuracy for downstream tasks. These findings open the questions about the expressive power of tuning the normalization layers of frozen networks. In this work, we take the first step towards this question and show that for random ReLU networks, fine-tuning only its normalization layers can reconstruct any target network that is $O(\sqrt{\text{width}})$ times smaller. We show that this holds even for randomly sparsified networks, under sufficient overparameterization, in agreement with prior empirical work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Learning polynomials with neural networks. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1908–1916, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/andoni14.html.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Francis Bach. Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res., 18(1):629–681, jan 2017. ISSN 1532-4435.
  4. The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pages 342–350. PMLR, 2017.
  5. Batch normalization explained. arXiv preprint arXiv:2209.14778, 2022.
  6. A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993. doi: 10.1109/18.256500.
  7. Sharp transition of the invertibility of the adjacency matrices of sparse random graphs, 2018. URL https://arxiv.org/abs/1809.08454.
  8. Universal representations:the missing link between faces, text, planktons, and cat breeds, 2017.
  9. Understanding batch normalization. Advances in neural information processing systems, 31, 2018.
  10. The zero set of a polynomial. WSMR Report, pages 05–02, 2005.
  11. Bam! born-again multi-task networks for natural language understanding. arXiv preprint arXiv:1907.04829, 2019.
  12. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR, 2014.
  13. Gradient descent provably optimizes over-parameterized neural networks, 2018. URL https://arxiv.org/abs/1810.02054.
  14. Training batchnorm and only batchnorm: On the expressive power of random features in cnns. arXiv preprint arXiv:2003.00152, 2020.
  15. Linearized two-layers neural networks in high dimension, 2019. URL https://arxiv.org/abs/1904.12191.
  16. Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. arXiv preprint arXiv:1709.08145, 2017.
  17. Approximation spaces of deep neural networks. Constructive approximation, 55(1):259–367, 2022.
  18. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463, 2020.
  19. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
  20. Topics in Matrix Analysis. Cambridge University Press, 1991. doi: 10.1017/CBO9780511840371.
  21. Parameter-efficient transfer learning for nlp, 2019. URL https://arxiv.org/abs/1902.00751.
  22. On the approximation power of two-layer networks of random relus. In Conference on Learning Theory, pages 2423–2461. PMLR, 2021.
  23. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
  24. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015. URL https://arxiv.org/abs/1502.03167.
  25. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  26. Neural tangent kernels, transportation mappings, and universal approximation, 2019. URL https://arxiv.org/abs/1910.06956.
  27. Almost-sure identifiability of multidimensional harmonic retrieval. IEEE Transactions on Signal Processing, 49(9):1849–1859, 2001. doi: 10.1109/78.942615.
  28. Exploring low rank training of deep neural networks, 2022. URL https://arxiv.org/abs/2209.13569.
  29. Approximation by combinations of relu and squared relu ridge functions with â„“1superscriptâ„“1\ell^{1}roman_â„“ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and â„“0superscriptâ„“0\ell^{0}roman_â„“ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT controls, 2016. URL https://arxiv.org/abs/1607.07819.
  30. Exponential convergence rates for batch normalization: The power of length-direction decoupling in non-convex optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 806–815. PMLR, 2019.
  31. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
  32. Dense classification and implanting for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9258–9267, 2019.
  33. Hadamard, khatri-rao, kronecker and other matrix products. International Journal of Information & Systems Sciences, 4, 01 2008.
  34. Towards understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846, 2018.
  35. On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.
  36. K for the price of 1: Parameter-efficient multi-task and transfer learning. arXiv preprint arXiv:1810.10703, 2018.
  37. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.
  38. Uniform approximation of functions with random bases. 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561, 2008a.
  39. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008b. URL https://proceedings.neurips.cc/paper/2008/file/0efe32849d230d7f53049ddc4a4b0c60-Paper.pdf.
  40. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.
  41. How does batch normalization help optimization? Advances in neural information processing systems, 31, 2018.
  42. On the approximation properties of random relu features, 2018. URL https://arxiv.org/abs/1810.04374.
  43. Pufferfish: Communication-efficient models at no extra cost, 2021. URL https://arxiv.org/abs/2103.03936.
  44. A convergence analysis of log-linear training. Advances in Neural Information Processing Systems, 24, 2011.
  45. A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129, 2019.
  46. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  47. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014.
  48. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
Citations (6)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 23 likes about this paper.