Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR (2403.06387v1)
Abstract: It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by $28.4\%$ relatively with a $5.57\%$ WER, and achieves $3.32/4.44\%$ WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.
- J. Heymann, M. Bacchiani, and T. N. Sainath, “Performance of mask based statistical beamforming in a smart home scenario,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 6722–6726.
- Y. Fu, L. Cheng, S. Lv et al., “AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,” in Proc. INTERSPEECH, 2021, pp. 3665–3669.
- D. L. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, pp. 1702–1726, 2018.
- P. Wang, K. Tan, and D. L. Wang, “Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic modeling,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 28, pp. 39–48, 2020.
- M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Commun., vol. 34, pp. 267–285, 2001.
- B. Raj, M. L. Seltzer, and R. M. Stern, “Reconstruction of missing features for robust speech recognition,” Speech Commun., vol. 43, pp. 275–296, 2004.
- A. Narayanan and D. L. Wang, “Investigation of speech separation as a front-end for noise robust speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 22, pp. 826–835, 2014.
- W. Zhang, J. Shi, C. Li, S. Watanabe, and Y. Qian, “Closing the gap between time-domain multi-channel speech enhancement on real and simulation conditions,” in Proc. IEEE Workshop on Appl. of Signal Process. to Audio and Acoustics, 2021, pp. 146–150.
- Y. Wang and D. L. Wang, “Towards scaling up classification-based speech separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 21, pp. 1381–1390, 2013.
- K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,” Speech Commun., vol. 53, pp. 465–494, 2011.
- D. S. Williamson, Y. Wang, and D. L. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, pp. 483–492, 2016.
- Y. Hu, Y. Liu, S. Lv et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Proc. INTERSPEECH, 2020, pp. 2472–2476.
- S.-W. Fu, T.-y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by convolutional neural network with multi-metrics learning,” in Proc. IEEE Int. Workshop Mach. Learn. Signal Process., 2017, pp. 1–6.
- K. Tan and D. L. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 28, pp. 380–390, 2019.
- Z.-Q. Wang, P. Wang, and D. L. Wang, “Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 28, pp. 1778–1787, 2020.
- S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf., 2017, pp. 006–012.
- S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech enhancement generative adversarial network,” in Proc. INTERSPEECH, 2017, pp. 3642–3646.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, pp. 1256–1266, 2019.
- A. Pandey and D. L. Wang, “Self-attending RNN for speech enhancement to improve cross-corpus generalization,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 30, pp. 1374–1385, 2022.
- V. A. Kalkhorani and D. L. Wang, “CrossNet: Leveraging global, cross-band, narrow-band, and positional encoding for single- and multi-channel speaker separation,” arXiv:2403.03411, 2024.
- J. Barker, E. Vincent, N. Ma, H. Christensen, and P. Green, “The PASCAL CHiME speech separation and recognition challenge,” Comput. Speech Lang., vol. 27, pp. 621–633, 2013.
- E. Vincent, J. Barker, S. Watanabe et al., “The second ‘CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2013, pp. 126–130.
- J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: dataset, task and baselines,” in Proc. IEEE ASRU, 2015, pp. 504–511.
- E. Vincent, S. Watanabe, A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Comput. Speech Lang., vol. 46, pp. 535–557, 2017.
- M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2013, pp. 7398–7402.
- Z. Meng, J. Li, Y. Gong, and B.-H. F. Juang, “Adversarial feature-mapping for speech enhancement,” in Proc. INTERSPEECH, 2018, pp. 3259–3263.
- P. Wang and D. L. Wang, “Enhanced spectral features for distortion-independent acoustic modeling,” in Proc. INTERSPEECH, 2019, pp. 476–480.
- P. Plantinga, D. Bagchi, and E. Fosler-Lussier, “Perceptual loss with recognition model for single-channel enhancement and robust ASR,” arXiv:2112.06068, 2021.
- K.-H. Ho, E.-L. Yu, J.-w. Hung, and B. Chen, “NAaLoss: Rethinking the objective of speech enhancement,” in Proc. IEEE Int. Workshop on Machine Learning for Signal Process., 2023, pp. 1–6.
- A. Narayanan, J. Walker, S. Panchapagesan, N. Howard, and Y. Koizumi, “Learning mask scalars for improved robust automatic speech recognition,” in Proc. IEEE Spoken Language Technology Workshop, 2023, pp. 317–323.
- K. Iwamoto, T. Ochiai, M. Delcroix et al., “How bad are artifacts?: Analyzing the impact of speech enhancement errors on ASR,” in Proc. INTERSPEECH, 2022, pp. 5418–5422.
- C. Zorilă and R. Doddipatla, “Speaker reinforcement using target source extraction for robust automatic speech recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 6297–6301.
- K. Kinoshita, T. Ochiai, M. Delcroix, and T. Nakatani, “Improving noise robust automatic speech recognition with single-channel time-domain enhancement network,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 7009–7013.
- Y. Xu, C. Weng, L. Hui et al., “Joint training of complex ratio mask based beamformer and acoustic model for noise robust ASR,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2019, pp. 6745–6749.
- T. Menne, R. Schlüter, and H. Ney, “Investigation into joint optimization of single channel speech enhancement and acoustic modeling for robust ASR,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2019, pp. 6660–6664.
- Q.-S. Zhu, J. Zhang, Z.-Q. Zhang, and L.-R. Dai, “Joint training of speech enhancement and self-supervised model for noise-robust ASR,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 31, pp. 1927–1939, 2023.
- J. Shi, X. Chang, S. Watanabe, and B. Xu, “Train from scratch: Single-stage joint training of speech separation and recognition,” Comput. Speech Lang., vol. 76, p. 101387, 2022.
- X. Chang, T. Maekaku, Y. Fujita, and S. Watanabe, “End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation,” in Proc. INTERSPEECH, 2022, pp. 3819–3823.
- Y. Masuyama, X. Chang, W. Zhang et al., “Exploring the integration of speech separation and recognition with self-supervised learning representation,” in Proc. IEEE Workshop on Appl. of Signal Process. to Audio and Acoustics, 2023, pp. 1–5.
- C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 19, pp. 2125–2136, 2011.
- Y. Yang, A. Pandey, and D. L. Wang, “Time-domain speech enhancement for robust automatic speech recognition,” in Proc. INTERSPEECH, 2023, pp. 4913–4917.
- C. Quan and X. Li, “SpatialNet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 32, pp. 1310–1323, 2024.
- Y. Yang, P. Wang, and D. L. Wang, “A Conformer based acoustic model for robust automatic speech recognition,” arXiv:2203.00725, 2022.
- S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv:1605.07146, 2016.
- S. Watanabe, T. Hori, S. Karita et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. INTERSPEECH, 2018, pp. 2207–2211.
- J. Heymann, L. Drude, and R. Haeb-Umbach, “Wide residual BLSTM network with discriminative speaker adaptation for robust speech recognition,” in Proc. CHiME-4 Workshop, vol. 78, 2016, p. 79.
- A. Pandey, B. Xu, A. Kumar et al., “TPARN: Triple-path attentive recurrent network for time-domain multichannel speech enhancement,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 6497–6501.
- R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 351–355.
- D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Proc. Workshop on Speech and Natural Language, 1992, pp. 357–362.
- A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., vol. 12, pp. 247–251, 1993.
- A. Pandey and D. L. Wang, “Dense CNN with self-attention for time-domain speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29, pp. 1270–1279, 2021.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–15.
- A. H. Moore, P. P. Parada, and P. A. Naylor, “Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures,” Comput. Speech Lang., vol. 46, pp. 574–584, 2017.
- H. Christensen, J. Barker, N. Ma, and P. D. Green, “The CHiME corpus: a resource and a challenge for computational hearing in multisource environments,” in Proc. INTERSPEECH, 2010, pp. 1918–1921.
- J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2019, pp. 626–630.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an asr corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 5206–5210.
- A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 2, 2001, pp. 749–752.
- A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
- Z.-Q. Wang and D. L. Wang, “A joint training framework for robust automatic speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, pp. 796–806, 2016.
- K. Tan, J. Chen, and D. L. Wang, “Gated residual networks with dilated convolutions for monaural speech enhancement,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, pp. 189–198, 2018.
- S.-J. Chen, A. Subramanian, H. Xu, and S. Watanabe, “Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline,” in Proc. INTERSPEECH, 2018, pp. 1571–1575.
- P. Guo, F. Boyer, X. Chang et al., “Recent developments on ESPnet toolkit boosted by Conformer,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 5874–5878.
- J. Du, Y.-H. Tu, L. Sun et al., “The USTC-iFlytek system for CHiME-4 challenge,” in Proc. CHiME-4, 2016, pp. 36–38.
- D. Yang, W. Wang, and Y. Qian, “FAT-HuBERT: Front-end adaptive training of hidden-unit BERT for distortion-invariant robust speech recognition,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2023, pp. 1–8.
- Y. Hu, C. Chen, Q. Zhu, and E. S. Chng, “Wav2code: Restore clean speech representations via codebook lookup for noise-robust ASR,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 32, pp. 1145–1156, 2024.
- H. Wang, Y. Qian, X. Wang et al., “Improving noise robustness of contrastive speech representation learning with speech reconstruction,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 6062–6066.
- Q.-S. Zhu, L. Zhou, J. Zhang et al., “Robust data2vec: Noise-robust speech representation learning for ASR by combining regression and improved contrastive learning,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2023, pp. 1–5.
- Yufeng Yang (21 papers)
- Ashutosh Pandey (26 papers)
- DeLiang Wang (43 papers)