Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition (2311.17790v1)

Published 29 Nov 2023 in cs.SD and eess.AS

Abstract: Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. To address the distortions introduced by the SE frontends, we introduce layer-wise fusion modules that incorporate features extracted from both observed noisy signals and enhanced signals. During training, the SE frontend is randomly selected from a pool of models. We evaluate the performance of FAT-HuBERT on simulated noisy speech generated from LibriSpeech as well as real-world noisy speech from the CHiME-4 1-channel dataset. The experimental results demonstrate a significant relative reduction in word error rate (WER).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
  2. “Deep learning for environmentally robust speech recognition: An overview of recent developments,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 9, no. 5, pp. 1–28, 2018.
  3. “Interactive feature fusion for end-to-end noise-robust speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6292–6296.
  4. “Multi-task self-supervised learning for robust speech recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6989–6993.
  5. “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  6. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021.
  7. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 12449–12460.
  8. “Decoar 2.0: Deep contextualized acoustic representations with vector quantization,” arXiv preprint arXiv:2012.06659, 2020.
  9. “Contrastive siamese network for semi-supervised speech recognition,” in IEEE ICASSP, 2022, pp. 7207–7211.
  10. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  11. “Wav2vec-Switch: Contrastive learning from original-noisy speech pairs for robust speech recognition,” in IEEE ICASSP, 2022, pp. 7097–7101.
  12. “wav2vec-C: A self-supervised model for speech representation learning,” in INTERSPEECH, 2021, pp. 711–715.
  13. “Improving noise robustness of contrastive speech representation learning with speech reconstruction,” in IEEE ICASSP, 2022, pp. 6062–6066.
  14. “Hubert-agg: Aggregated representation distillation of hidden-unit bert for robust speech recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  15. “How bad are artifacts?: Analyzing the impact of speech enhancement errors on asr,” 2022.
  16. “Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 39–48, 2019.
  17. “Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions,” IEEE transactions on audio, speech, and language processing, vol. 19, no. 1, pp. 47–56, 2010.
  18. “A joint training framework for robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 4, pp. 796–806, 2016.
  19. “Jointly adversarial enhancement training for robust end-to-end speech recognition.,” in Interspeech, 2019, pp. 491–495.
  20. “Gated recurrent fusion with joint training framework for robust end-to-end speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 198–209, 2020.
  21. “Multi-variant consistency based self-supervised learning for robust automatic speech recognition,” arXiv preprint arXiv:2112.12522, 2021.
  22. “Joint training of speech enhancement and self-supervised model for noise-robust asr,” arXiv preprint arXiv:2205.13293, 2022.
  23. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
  24. “Speech enhancement with phase sensitive mask estimation using a novel hybrid neural network,” IEEE Open Journal of Signal Processing, vol. 2, pp. 136–150, 2021.
  25. Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  26. “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 46–50.
  27. Peter Ochieng, “Deep neural network techniques for monaural speech enhancement: State of the art analysis,” arXiv preprint arXiv:2212.00369, 2022.
  28. “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” 2020.
  29. “Phase-aware speech enhancement with deep complex u-net,” in International Conference on Learning Representations, 2018.
  30. “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” arXiv preprint arXiv:2007.13975, 2020.
  31. “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  32. “Librispeech: An ASR corpus based on public domain audio books,” in IEEE ICASSP, 2015, pp. 5206–5210.
  33. “Wham!: Extending speech separation to noisy environments,” 2019.
  34. “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech and Language, vol. 46, pp. 535–557, 2017.
  35. “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211.
  36. “Skim: Skipping memory lstm for low-latency real-time continuous speech separation,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 681–685.
  37. “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Proc. Interspeech 2015, 2015, pp. 3274–3278.
  38. “fairseq: A fast, extensible toolkit for sequence modeling,” in Proc. NAACL-HLT: Demonstrations, 2019, pp. 48–53.
  39. “The design for the Wall Street Journal-based CSR corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, 1992.
  40. “Espresso: A fast end-to-end neural speech recognition toolkit,” in IEEE ASRU, 2019, pp. 136–143.
  41. “A conformer based acoustic model for robust automatic speech recognition,” arXiv preprint arXiv:2203.00725, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Dongning Yang (1 paper)
  2. Wei Wang (1793 papers)
  3. Yanmin Qian (96 papers)
Citations (3)