Compressing Transformer-based self-supervised models for speech processing (2211.09949v2)
Abstract: Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, but the settings and metrics are different across studies. Trade-off at various compression rates are also largely missing in prior work, making it difficult to compare compression techniques. In this work, we aim to provide context for the isolated results, studying several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report trade- off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations. Our results show that compared to recent approaches, basic compression techniques are strong baselines. We further present several applications of our results, revealing properties of Transformers, such as the significance of diagonal attention heads. In addition, our results lead to a simple combination of compression techniques that improves trade-off over recent approaches. We hope the results would promote more diverse comparisons among model compression techniques and promote the use of model compression as a tool for analyzing models. Our code of compressing speech self-supervised model is available at https://github.com/nervjack2/Speech-SSL-Compression/.
- “SUPERB: Speech processing universal performance benchmark,” in Interspeech, 2021.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, 2020.
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
- “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- “w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in ASRU, 2021.
- “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in ICLR, 2016.
- “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in ICLR, 2021.
- “Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation,” in EMNLP, 2020.
- “Low-rank matrix factorization for deep neural network training with high-dimensional output targets,” in ICASSP, 2013.
- “Speeding up convolutional neural networks with low rank expansions,” in Proceesings of the British Machine Vision Association, 2014.
- “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing systems, 2014.
- “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- “Layer-wise analysis of a self-supervised speech representation model,” in ASRU, 2021.
- “PARP: Prune, adjust and re-prune for self-supervised speech recognition,” in Advances in Neural Information Processing Systems, 2021.
- “Shrinking bigfoot: Reducing wav2vec 2.0 footprint,” in Workshop on Simple and Efficient Natural Language Processing, 2021.
- “DistillW2V2: A small and streaming wav2vec 2.0 based ASR model,” arXiv preprint arXiv:2303.09278, 2023.
- “Structured pruning of self-supervised pre-trained models for speech recognition and understanding,” in ICASSP, 2023.
- “The lottery tickets hypothesis for supervised and self-supervised pre-training in computer vision models,” in CVPR, 2021.
- “DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT,” in ICASSP, 2022.
- “LightHuBERT: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT,” in Interspeech, 2022.
- “FitHuBERT: Going thinner and deeper for knowledge distillation of speech self-supervised learning,” in Interspeech, 2022.
- “Improving generalizability of distilled self-supervised speech processing models under distorted settings,” in SLT, 2023.
- “RobustDistiller: Compressing universal speech representations for enhanced environment robustness,” in ICASSP, 2023.
- “Ensemble knowledge distillation of self-supervised speech models,” in ICASSP, 2023.
- “Task-agnostic structured pruning of speech representation models,” in Interspeech, 2023.
- “Recycle-and-distill: Universal compression strategy for transformer-based speech SSL models with attention map reusing and masking distillation,” in Interspeech, 2023.
- “STaR: Distilling speech temporal relation for lightweight speech self-supervised learning models,” in ICASSP, 2024.
- “MelHuBERT: A simplified HuBERT on Mel spectrograms,” in ASRU, 2023.
- “Optimal brain damage,” in Advances in Neural Information Processing Systems, 1990.
- “Learning both weights and connections for efficient neural networks,” in Advances in Neural Information Processing Systems, 2015.
- “Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning,” ASRU, 2023.
- “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” in ACL, 2019.
- “Are sixteen heads really better than one?,” in Advances in Neural Information Processing Systems, 2019.
- “Language model compression with weighted low-rank factorization,” in ICLR, 2022.
- “Do deep nets really need to be deep?,” in Advances in Neural Information Processing Systems, 2014.
- “FitNets: Hints for thin deep nets,” in ICLR, 2015.
- “Knowledge distillation from internal representations,” in AAAI, 2020.
- “Comparing Kullback-Leibler divergence and mean squared error loss in knowledge distillation,” arXiv preprint arXiv:2105.08919, 2021.
- “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
- “TinyBERT: Distilling BERT for natural language understanding,” in Findings of EMNLP, 2020.
- “Training data-efficient image transformers & distillation through attention,” in ICML, 2021.
- “MiniViT: Compressing vision transformers with weight multiplexing,” in CVPR, 2022.
- “Compressing BERT: Studying the effects of weight pruning on transfer learning,” in Workshop on Representation Learning for NLP, 2020.
- “NViT: Vision transformer compression and parameter redistribution,” arXiv preprint arXiv:2110.04869, 2021.
- “Structured pruning of large language models,” in EMNLP, 2020.
- “Q8BERT: Quantized 8bit BERT,” in Energy Efficient Training and Inference of Transformer Based Models, 2019.
- “Q-BERT: Hessian based ultra low precision quantization of BERT,” in AAAI, 2020.
- “Q-ViT: Accurate and fully quantized low-bit vision transformer,” in Advances in Neural Information Processing Systems, 2022.
- “Reducing transformer depth on demand with structured dropout,” arXiv preprint arXiv:1909.11556, 2019.
- “FastBERT: a self-distilling BERT with adaptive inference time,” in ACL, 2020.
- “On the usefulness of self-attention for automatic speech recognition with transformers,” in SLT, 2021.
- “Efficient training of BERT by progressively stacking,” in ICML, 2019.
- “Revealing the dark secrets of BERT,” in EMNLP, 2019.