STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models (2312.09040v2)
Abstract: Albeit great performance of Transformer-based speech selfsupervised learning (SSL) models, their large parameter size and computational cost make them unfavorable to utilize. In this study, we propose to compress the speech SSL models by distilling speech temporal relation (STaR). Unlike previous works that directly match the representation for each speech frame, STaR distillation transfers temporal relation between speech frames, which is more suitable for lightweight student with limited capacity. We explore three STaR distillation objectives and select the best combination as the final STaR loss. Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters. We show that our method is applicable across different speech SSL models and maintains robust performance with further reduced parameters.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, pp. 12449–12460.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- “Superb: Speech processing universal performance benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198.
- “Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning,” in Proc. SLT, 2023, pp. 1096–1103.
- “Parp: Prune, adjust and re-prune for self-supervised speech recognition,” in Proc. NeurIPS, 2021, pp. 21256–21272.
- “Deep compression of pre-trained transformer models,” in Proc. NeurIPS, 2022, pp. 14140–14154.
- “Masked token similarity transfer for compressing transformer-based asr models,” in Proc. ICASSP, 2023, pp. 1–5.
- “Lightweight feature encoder for wake-up word detection based on self-supervised speech representation,” in Proc. ICASSP, 2023, pp. 1–5.
- “One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification,” in Proc. Interspeech, 2023, pp. 5271–5275.
- “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in Proc. ICASSP, 2022, pp. 7087–7091.
- “Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning,” in Proc. Interspeech, 2022, pp. 3588–3592.
- “Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation,” in Proc. Interspeech, 2023, pp. 316–320.
- “Attention is all you need,” in Proc. NeurIPS, 2017, pp. 6000–6010.
- “Dphubert: Joint distillation and pruning of self-supervised speech models,” in Proc. Interspeech, 2023, pp. 62–66.
- “Task-agnostic structured pruning of speech representation models,” in Proc. Interspeech, 2023, pp. 231–235.
- “Learning sparse neural networks through l_0 regularization,” in Proc. ICLR, 2018.
- “Lighthubert: Lightweight and configurable speech representation learning with once-for-all hidden-unit bert,” in Proc. Interspeech, 2022.
- “Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models,” in Proc. Interspeech, 2022, pp. 411–415.
- “Unsupervised speech recognition,” in Proc. NeurIPS, 2021, pp. 27826–27839.
- “Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers,” in Proc. NeurIPS, 2020, pp. 5776–5788.
- “Compressing visual-linguistic model via knowledge distillation,” in Proc. ICCV, 2021, pp. 1428–1438.
- “Texture synthesis using convolutional neural networks,” in Proc. NeurIPS, 2015, pp. 262–270.
- “Speaker embeddings by modeling channel-wise correlations,” in Proc. Interspeech, 2021, pp. 501–505.
- “Perceptual-similarity-aware deep speaker representation learning for multi-speaker generative modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1033–1048, 2021.
- “Non-parallel many-to-many voice conversion with psr-stargan.,” in Proc. Interspeech, 2020, pp. 781–785.
- “U-gat-vc: Unsupervised generative attentional networks for non-parallel voice conversion,” in Proc. ICASSP, 2022, pp. 7017–7021.
- “A kernel two-sample test,” The Journal of Machine Learning Research, vol. 13, pp. 723–773, 2012.
- “Demystifying neural style transfer,” in Proc. IJCAI, 2017, pp. 2230–2236.
- “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proc. CVPR, 2017, pp. 4133–4141.
- “An efficient method for model pruning using knowledge distillation with few samples,” in Proc. ICASSP, 2022, pp. 2515–2519.
- “Correlation congruence for knowledge distillation,” in Proc. ICCV, 2019, pp. 5007–5016.
- “Cross-modal distillation for speaker recognition,” in Proc. AAAI, 2023, pp. 12977–12985.
- “Comparative layer-wise analysis of self-supervised speech models,” in Proc. ICASSP, 2023, pp. 1–5.
- “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.