Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations (2403.06260v1)

Published 10 Mar 2024 in cs.CL, cs.SD, and eess.AS

Abstract: There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.58%, and 12.65%, respectively. SCORE provides competitive results with the recently proposed SSFT method SPIN, using only 1/3 of the processed speech compared to SPIN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 29, pp. 3451–3460, oct 2021.
  2. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, July 2022.
  3. “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1179–1210, 2022.
  4. “Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering,” in Proc. INTERSPEECH 2023, 2023, pp. 2983–2987.
  5. “ContentVec: An improved self-supervised speech representation by disentangling speakers,” in Proceedings of the 39th International Conference on Machine Learning, Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, Eds. 17–23 Jul 2022, vol. 162 of Proceedings of Machine Learning Research, pp. 18003–18017, PMLR.
  6. “Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation,” in Proc. Interspeech 2022, 2022, pp. 2193–2197.
  7. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 12449–12460, Curran Associates, Inc.
  8. H. Kamper, “Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6535–3539, 2019.
  9. A. Meghanani and T. Hain, “Deriving translational acoustic sub-word embeddings,” in Proc. of ASRU, 2023.
  10. “Audio augmentation for speech recognition,” in Proc. Interspeech 2015, 2015, pp. 3586–3589.
  11. M Cuturi and M Blondel, “Soft-DTW: a differentiable loss function for time-series,” in Proceedings of the 34th International Conference on Machine Learning, Doina Precup and Yee Whye Teh, Eds. 06–11 Aug 2017, vol. 70 of Proceedings of Machine Learning Research, pp. 894–903, PMLR.
  12. M. Maghoumi, Deep Recurrent Networks for Gesture Recognition and Synthesis, Ph.D. thesis, University of Central Florida Orlando, Florida, 2020.
  13. “Soft dynamic time warping for multi-pitch estimation and beyond,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  14. “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021, pp. 1194–1198.
  15. “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert,” in Prof. of ICASSP 2022), 2022, pp. 7087–7091.
  16. “Layer-wise analysis of a self-supervised speech representation model,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921.
  17. “Torchaudio: Building blocks for audio and speech processing,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6982–6986.
  18. “Deepnag: Deep non-adversarial gesture generation,” in 26th International Conference on Intelligent User Interfaces, 2021, pp. 213–223.
  19. R. Tavenard, “Machine learning for time series – notes from lectures at ensai,” 2021.
  20. “Differentiable divergences between time series,” in Proc. of AIStat. 13–15 Apr 2021, vol. 130 of Proceedings of Machine Learning Research, pp. 3853–3861, PMLR.
  21. “Tslearn, a machine learning toolkit for time series data,” Journal of Machine Learning Research, vol. 21, no. 118, pp. 1–6, 2020.
  22. “Better fine-tuning by reducing representational collapse,” CoRR, vol. abs/2008.03156, 2020.
  23. “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  24. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019.
  25. “Quesst2014: Evaluating query-by-example speech search in a zero-resource setting with real-life queries,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5833–5837.
  26. “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, pp. 101027, 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.