Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments (2211.08774v3)

Published 16 Nov 2022 in cs.SD and eess.AS

Abstract: We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the proven method of concatenating speaker vectors to the acoustic features and supplying them as auxiliary model inputs remains a viable option to increase the robustness of end-to-end architectures. The effect on transformer models is stronger, when more noise is added to the input speech. The most substantial benefits for systems based on wav2vec 2.0 are achieved under moderate or no noise conditions. Both x-vectors and ECAPA-TDNN embeddings outperform i-vectors as speaker representations. The optimal embedding size depends on the dataset and also varies with the noise condition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. N. Tomashenko and Y. Estève, “Evaluation of feature-space speaker adaptation for end-to-end acoustic models,” in Proc. LREC, 2018.
  2. “Speaker adaptation for multichannel end-to-end speech recognition,” in 2018 IEEE ICASSP, 2018, pp. 6707–6711.
  3. “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
  4. “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE ICASSP, 2018, pp. 5329–5333.
  5. “Online speaker adaptation for LVCSR based on attention mechanism,” in 2018 APSIPA ASC, 2018, pp. 183–186.
  6. “Speaker-aware speech-transformer,” in 2019 IEEE ASRU, 2019, pp. 222–229.
  7. “Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training,” in 2006 IEEE ICASSP, 2006, vol. 1.
  8. H. Liao, “Speaker adaptation of context dependent deep neural networks,” in 2013 IEEE ICASSP, 2013, pp. 7947–7951.
  9. “Adversarial speaker adaptation,” in 2019 IEEE ICASSP, 2019, pp. 5721–5725.
  10. “Speaker adaptation of neural network acoustic models using i-vectors,” in 2013 IEEE ASRU, 2013, pp. 55–59.
  11. “I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription,” in 2014 IEEE ICASSP, 2014, pp. 6334–6338.
  12. “Attention is all you need,” in NeurIPS, 2017, vol. 30.
  13. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020, vol. 33, pp. 12449–12460.
  14. “Speech Transformer with Speaker Aware Persistent Memory,” in Proc. Interspeech 2020, 2020, pp. 1261–1265.
  15. “On the Limit of English Conversational Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 2062–2066.
  16. “Confidence Score Based Conformer Speaker Adaptation for Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 2623–2627.
  17. “Improving the Training Recipe for a Robust Conformer-based Hybrid Model,” in Proc. Interspeech 2022, 2022, pp. 1036–1040.
  18. “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
  19. P. Denisov and N. T. Vu, “End-to-End Multi-Speaker Speech Recognition Using Speaker Embeddings and Transfer Learning,” in Proc. Interspeech 2019, 2019, pp. 4425–4429.
  20. “Speaker adaptation for Wav2vec2 based dysarthric ASR,” in Proc. Interspeech 2022, 2022, pp. 3403–3407.
  21. “Investigation of speaker-adaptation methods in transformer based asr,” 2020.
  22. “Unsupervised speaker adaptation using attention-based speaker memory for end-to-end asr,” in 2020 IEEE ICASSP, 2020, pp. 7384–7388.
  23. “Auxiliary Feature Based Adaptation of End-to-end ASR Systems,” in Proc. Interspeech 2018, 2018, pp. 2444–2448.
  24. “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, 2020, pp. 3830–3834.
  25. “Switchboard-1 release 2,” 1993.
  26. “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE ICASSP, 2015, pp. 5206–5210.
  27. “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620.
  28. T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. EMNLP 2018, 2018, pp. 66–71.
  29. “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
  30. “Libri-light: A benchmark for asr with limited or no supervision,” in 2020 IEEE ICASSP, 2020, pp. 7669–7673.
  31. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015, vol. 37, pp. 448–456.
  32. “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014.
  33. “SpecAugment: A simple data augmentation method for automatic speech recognition,” Interspeech 2019, 2019.
  34. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR 2015, 2015.
  35. M. D. Zeiler, “Adadelta: An adaptive learning rate method,” 2012.
  36. “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  37. “Layer normalization,” 2016.
  38. “The Fisher corpus: a resource for the next generations of speech-to-text,” in LREC, 2004.
  39. “The Kaldi speech recognition toolkit,” in 2011 IEEE ASRU, 2011.
  40. “Res2net: A new multi-scale backbone architecture,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 43, no. 02, pp. 652–662, 2021.
  41. “Squeeze-and-excitation networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
  42. “Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI,” in Proc. Interspeech 2016, 2016, pp. 2751–2755.
  43. “Exploring wav2vec 2.0 on Speaker Verification and Language Identification,” in Proc. Interspeech 2021, 2021, pp. 1509–1513.
Citations (2)

Summary

We haven't generated a summary for this paper yet.