Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification (2403.00293v1)

Published 1 Mar 2024 in eess.AS, cs.LG, and cs.SD

Abstract: With excellent generalization ability, self-supervised speech models have shown impressive performance on various downstream speech tasks in the pre-training and fine-tuning paradigm. However, as the growing size of pre-trained models, fine-tuning becomes practically unfeasible due to heavy computation and storage overhead, as well as the risk of overfitting. Adapters are lightweight modules inserted into pre-trained models to facilitate parameter-efficient adaptation. In this paper, we propose an effective adapter framework designed for adapting self-supervised speech models to the speaker verification task. With a parallel adapter design, our proposed framework inserts two types of adapters into the pre-trained model, allowing the adaptation of latent features within intermediate Transformer layers and output embeddings from all Transformer layers. We conduct comprehensive experiments to validate the efficiency and effectiveness of the proposed framework. Experimental results on the VoxCeleb1 dataset demonstrate that the proposed adapters surpass fine-tuning and other parameter-efficient transfer learning methods, achieving superior performance while updating only 5% of the parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. “X-vectors: Robust dnn embeddings for speaker recognition,” in ICASSP, 2018, pp. 5329–5333.
  2. “But system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
  3. “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
  4. “Deep speaker embedding extraction with channel-wise feature responses and additive supervision softmax loss function.,” in Interspeech, 2019, pp. 2883–2887.
  5. “MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification,” in Proc. Interspeech 2022, 2022, pp. 306–310.
  6. “Improving transformer-based networks with locality for automatic speaker verification,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  7. “Unispeech-sat: Universal speech representation learning with speaker aware pre-training,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6152–6156.
  8. “Self-supervised speaker verification with simple siamese network and self-supervised regularization,” in ICASSP. IEEE, 2022, pp. 6127–6131.
  9. “Contrastive self-supervised learning for text-independent speaker verification,” in ICASSP. IEEE, 2021, pp. 6713–6717.
  10. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  11. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  12. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL, 2019, pp. 4171–4186.
  13. “Dual contrastive learning framework for incremental text classification,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 194–206.
  14. “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning. PMLR, 2019, pp. 2790–2799.
  15. “Efficient adapter transfer of self-supervised speech models for automatic speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7102–7106.
  16. “Draft: A novel framework to reduce domain shifting in self-supervised learning and its application to children’s asr,” arXiv preprint arXiv:2206.07931, 2022.
  17. “Chapter: Exploiting convolutional neural network adapters for self-supervised speech models,” in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2023, pp. 1–5.
  18. “wav2vec: Unsupervised pre-training for speech recognition,” Proc. Interspeech 2019, pp. 3465–3469, 2019.
  19. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  20. “Phoneme hallucinator: One-shot voice conversion via set expansion,” arXiv preprint arXiv:2308.06382, 2023.
  21. “Towards a unified view of parameter-efficient transfer learning,” International Conference on Learning Representations, 2022.
  22. “Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5227–5237.
  23. “Adaptformer: Adapting vision transformers for scalable visual recognition,” Advances in Neural Information Processing Systems, vol. 35, pp. 16664–16678, 2022.
  24. “Large-scale multilingual speech recognition with a streaming end-to-end model,” arXiv preprint arXiv:1909.05330, 2019.
  25. “Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition,” arXiv preprint arXiv:2012.01687, 2020.
  26. “Exploiting adapters for cross-lingual low-resource speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 317–329, 2021.
  27. “Parameter efficient transfer learning for various speech processing tasks,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  28. “Parameter-efficient transfer learning of pre-trained transformer models for speaker verification using adapters,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  29. “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021, pp. 1194–1198.
  30. “Voxceleb: A large-scale speaker identification dataset,” Proc. Interspeech 2017, pp. 2616–2620, 2017.
  31. “Open-set short utterance forensic speaker verification using teacher-student network with explicit inductive bias,” Proc. Interspeech 2020, pp. 2262–2266, 2020.
  32. “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Mufan Sang (7 papers)
  2. John H. L. Hansen (58 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.