Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Differentiable Learning of Generalized Structured Matrices for Efficient Deep Neural Networks (2310.18882v2)

Published 29 Oct 2023 in cs.LG, cs.AI, cs.CV, eess.IV, and eess.SP

Abstract: This paper investigates efficient deep neural networks (DNNs) to replace dense unstructured weight matrices with structured ones that possess desired properties. The challenge arises because the optimal weight matrix structure in popular neural network models is obscure in most cases and may vary from layer to layer even in the same network. Prior structured matrices proposed for efficient DNNs were mostly hand-crafted without a generalized framework to systematically learn them. To address this issue, we propose a generalized and differentiable framework to learn efficient structures of weight matrices by gradient descent. We first define a new class of structured matrices that covers a wide range of structured matrices in the literature by adjusting the structural parameters. Then, the frequency-domain differentiable parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to learn the structural parameters by proximal gradient descent. On the image and language tasks, our method learns efficient DNNs with structured matrices, achieving lower complexity and/or higher performance than prior approaches that employ low-rank, block-sparse, or block-low-rank matrices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Improving multifrontal methods by means of block low-rank representations. SIAM Journal on Scientific Computing, 37(3):A1451–A1474, 2015.
  2. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Real analysis. ClassicalRealAnalysis. com, 1997.
  5. Pixelated butterfly: Simple and efficient sparse training for neural network models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Nfl-iXa-y7R.
  6. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965.
  7. Learning fast algorithms for linear transforms using butterfly factorizations. In International conference on machine learning, pp.  1517–1527. PMLR, 2019.
  8. Monarch: Expressive structured matrices for efficient and accurate training. In International Conference on Machine Learning, pp.  4690–4721. PMLR, 2022.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  11. Implicit regularization in matrix factorization. Advances in neural information processing systems, 30, 2017.
  12. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  13. Geoffrey Hinton. Neural networks for machine learning coursera video lectures. 2012.
  14. Language model compression with weighted low-rank factorization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uPv9Y3gmAI5.
  15. The low-rank simplicity bias in deep networks. Transactions on Machine Learning Research, 2022.
  16. Low-rank compression of neural nets: Learning the rank of each layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8049–8059, 2020.
  17. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2016.
  18. Improving the complexity of block low-rank factorizations with fast matrix arithmetic. SIAM Journal on Matrix Analysis and Applications, 40(4):1478–1496, 2019.
  19. Learning multiple layers of features from tiny images. 2009.
  20. Chong Li and CJ Shi. Constrained optimization based low-rank approximation of deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  732–747, 2018.
  21. Butterfly factorization. Multiscale Modeling & Simulation, 13(2):714–732, 2015.
  22. Compressing neural networks: Towards determining the optimal layer-wise decomposition. Advances in Neural Information Processing Systems, 34:5328–5344, 2021.
  23. Runtime neural pruning. Advances in neural information processing systems, 30, 2017.
  24. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2018.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=S1jE5L5gl.
  27. Pointer sentinel mixture models. In International Conference on Learning Representations, 2016.
  28. Efficient neural architecture search via parameters sharing. In International conference on machine learning, pp.  4095–4104. PMLR, 2018.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Learning strides in convolutional neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=M752z9FKJP.
  31. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  32. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  33. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
  34. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp.  10347–10357. PMLR, 2021.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. The law of parsimony in gradient descent for learning deep linear networks. arXiv preprint arXiv:2306.01154, 2023.
  37. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  38. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
  39. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2016.
Citations (3)

Summary

We haven't generated a summary for this paper yet.