SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning (2402.16830v1)
Abstract: Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain by introducing SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network. The identification of the layers to distill is achieved through a hierarchical clustering procedure applied to layer similarity measures. Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.
- A. Babu et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in Proc. of Interspeech, 2021.
- “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. of ICLR, 2020.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
- S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Multi-task self-supervised learning for robust speech recognition,” in Proc. of ICASSP, 2020.
- “MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech,” in Proc. of ICASSP 2022.
- S. Sadhu et al., “Wav2vec-c: A self-supervised model for speech representation learning,” in Proc. of Interspeech, 2021.
- Y.-A. Chung et al., “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in Proc. of ASRU, 2021.
- “Speech self-supervised representation benchmarking: Are we doing it right?,” in Proc. of Interspeech, 2023.
- S. Evain et al., “Lebenchmark: A reproducible framework for assessing self-supervised representation learning from speech,” in Proc. of Interspeech, 2021.
- S. W. Yang et al., “SUPERB: Speech processing Universal PERformance Benchmark,” in Proc. of Interspeech, 2021.
- M. Ravanelli et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021, arXiv:2106.04624.
- M. Ravanelli and Y. Bengio, “Learning speaker representations with mutual information,” in Proc. of Interspeech, 2019.
- “Exploring wav2vec 2.0 on speaker verification and language identification,” in Proc. of Interspeech, 2021.
- “Multi-task voice activated framework using self-supervised learning,” in Proc. of ICASSP, 2022.
- “Speech emotion diarization: Which emotion appears when?,” in Proc. of ASRU, 2023.
- “Optimal brain damage,” in Proc. of NIPS, 1990.
- “DNN Quantization with Attention,” arXiv preprint arXiv:2103.13322, 2021.
- “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Proc. of NIPS, 2015.
- “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314, 2023.
- “Fine-tuning strategies for faster inference using speech self-supervised models: a comparative study,” in Proc. of ICASSP, 2023.
- “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in Proc. of ICASSP, 2022.
- “Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning,” in Proc. Interspeech, 2022.
- “Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models,” in Proc. of Interspeech, 2022.
- “DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models,” in Proc. of Interspeech, 2023.
- “Librispeech: an asr corpus based on public domain audio books,” in Proc. of ICASSP, 2015.
- “Distilling knowledge via knowledge review,” in Proc. of CVPR, 2021.
- “Structured pruning of large language models,” in Proc. of EMNLP, 2020.
- “Structured pruning learns compact and accurate models,” in Proc. of ACL, 2022.
- “Structured pruning of self-supervised pre-trained models for speech recognition and understanding,” in Proc. of ICASSP, 2023.
- “Learning sparse neural networks through l_0 regularization,” in Proc. of ICLR, 2018.
- “Similarity of neural network representations revisited,” in Proc. of ICML, 2019.
- “A kernel statistical test of independence,” in Proc. of NIPS, 2007.
- A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proc. of NeurIPS, 2019.
- J. Hwang et al., “Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch,” in Proc. of ASRU, 2023.
- M. Ott et al., “fairseq: A fast, extensible toolkit for sequence modeling,” in Proc. of NAACL 2019 Demo, 2019.
- T. Wolf et al., “Huggingface’s transformers: State-of-the-art natural language processing,” in Proc. of EMNLP, 2020.
- “Recycle-and-distill: Universal compression strategy for transformer-based speech ssl models with attention map reusing and masking distillation,” in Proc. Interspeech, 2023.