Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning (2402.16830v1)

Published 26 Feb 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain by introducing SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network. The identification of the layers to distill is achieved through a hierarchical clustering procedure applied to layer similarity measures. Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. A. Babu et al., “XLS-R: Self-supervised cross-lingual speech representation learning at scale,” in Proc. of Interspeech, 2021.
  2. “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. of ICLR, 2020.
  3. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
  4. S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  5. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  6. “Multi-task self-supervised learning for robust speech recognition,” in Proc. of ICASSP, 2020.
  7. “MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech,” in Proc. of ICASSP 2022.
  8. S. Sadhu et al., “Wav2vec-c: A self-supervised model for speech representation learning,” in Proc. of Interspeech, 2021.
  9. Y.-A. Chung et al., “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in Proc. of ASRU, 2021.
  10. “Speech self-supervised representation benchmarking: Are we doing it right?,” in Proc. of Interspeech, 2023.
  11. S. Evain et al., “Lebenchmark: A reproducible framework for assessing self-supervised representation learning from speech,” in Proc. of Interspeech, 2021.
  12. S. W. Yang et al., “SUPERB: Speech processing Universal PERformance Benchmark,” in Proc. of Interspeech, 2021.
  13. M. Ravanelli et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021, arXiv:2106.04624.
  14. M. Ravanelli and Y. Bengio, “Learning speaker representations with mutual information,” in Proc. of Interspeech, 2019.
  15. “Exploring wav2vec 2.0 on speaker verification and language identification,” in Proc. of Interspeech, 2021.
  16. “Multi-task voice activated framework using self-supervised learning,” in Proc. of ICASSP, 2022.
  17. “Speech emotion diarization: Which emotion appears when?,” in Proc. of ASRU, 2023.
  18. “Optimal brain damage,” in Proc. of NIPS, 1990.
  19. “DNN Quantization with Attention,” arXiv preprint arXiv:2103.13322, 2021.
  20. “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Proc. of NIPS, 2015.
  21. “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv preprint arXiv:2305.14314, 2023.
  22. “Fine-tuning strategies for faster inference using speech self-supervised models: a comparative study,” in Proc. of ICASSP, 2023.
  23. “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in Proc. of ICASSP, 2022.
  24. “Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning,” in Proc. Interspeech, 2022.
  25. “Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models,” in Proc. of Interspeech, 2022.
  26. “DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models,” in Proc. of Interspeech, 2023.
  27. “Librispeech: an asr corpus based on public domain audio books,” in Proc. of ICASSP, 2015.
  28. “Distilling knowledge via knowledge review,” in Proc. of CVPR, 2021.
  29. “Structured pruning of large language models,” in Proc. of EMNLP, 2020.
  30. “Structured pruning learns compact and accurate models,” in Proc. of ACL, 2022.
  31. “Structured pruning of self-supervised pre-trained models for speech recognition and understanding,” in Proc. of ICASSP, 2023.
  32. “Learning sparse neural networks through l_0 regularization,” in Proc. of ICLR, 2018.
  33. “Similarity of neural network representations revisited,” in Proc. of ICML, 2019.
  34. “A kernel statistical test of independence,” in Proc. of NIPS, 2007.
  35. A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proc. of NeurIPS, 2019.
  36. J. Hwang et al., “Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch,” in Proc. of ASRU, 2023.
  37. M. Ott et al., “fairseq: A fast, extensible toolkit for sequence modeling,” in Proc. of NAACL 2019 Demo, 2019.
  38. T. Wolf et al., “Huggingface’s transformers: State-of-the-art natural language processing,” in Proc. of EMNLP, 2020.
  39. “Recycle-and-distill: Universal compression strategy for transformer-based speech ssl models with attention map reusing and masking distillation,” in Proc. Interspeech, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.