SuperM2M: Supervised and Mixture-to-Mixture Co-Learning for Speech Enhancement and Robust ASR (2403.10271v2)
Abstract: The current dominant approach for neural speech enhancement is based on supervised learning by using simulated training data. The trained models, however, often exhibit limited generalizability to real-recorded data. To address this, this paper investigates training enhancement models directly on real target-domain data. We propose to adapt mixture-to-mixture (M2M) training, originally designed for speaker separation, for speech enhancement, by modeling multi-source noise signals as a single, combined source. In addition, we propose a co-learning algorithm that improves M2M with the help of supervised algorithms. When paired close-talk and far-field mixtures are available for training, M2M realizes speech enhancement by training a deep neural network (DNN) to produce speech and noise estimates in a way such that they can be linearly filtered to reconstruct the close-talk and far-field mixtures. This way, the DNN can be trained directly on real mixtures, and can leverage close-talk and far-field mixtures as a weak supervision to enhance far-field mixtures. To improve M2M, we combine it with supervised approaches to co-train the DNN, where mini-batches of real close-talk and far-field mixture pairs and mini-batches of simulated mixture and clean speech pairs are alternately fed to the DNN, and the loss functions are respectively (a) the mixture reconstruction loss on the real close-talk and far-field mixtures and (b) the regular enhancement loss on the simulated clean speech and noise. We find that, this way, the DNN can learn from real and simulated data to achieve better generalization to real data. We name this algorithm SuperM2M (supervised and mixture-to-mixture co-learning). Evaluation results on the CHiME-4 dataset show its effectiveness and potential.
- D. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, 2018.
- J. Chen, Y. Wang, S. E. Yoho, D. Wang, and E. W. Healy, “Large-Scale Training to Increase Speech Intelligibility for Hearing-Impaired Listeners in Novel Noises,” The Journal of the Acoustical Society of America, vol. 139, no. 5, pp. 2604–2612, 2016.
- A. Ephrat, I. Mosseri, O. Lang, T. Dekel et al., “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” ACM Transactions on Graphics, vol. 37, no. 4, 2018.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 1256–1266, 2019.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in ICASSP, 2020, pp. 46–50.
- K. Zmolikova, M. Delcroix et al., “Neural Target Speech Extraction: An overview,” IEEE Signal Process. Mag., vol. 40, no. 3, pp. 8–29, 2023.
- I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux et al., “Universal Sound Separation,” in WASPAA, 2019, pp. 175–179.
- E. Nachmani, Y. Adi, and L. Wolf, “Voice Separation with An Unknown Number of Multiple Speakers,” in ICML, 2020, pp. 7121–7132.
- N. Zeghidour and D. Grangier, “Wavesplit: End-to-End Speech Separation by Speaker Clustering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2840–2849, 2021.
- Z. Chen, T. Yoshioka, L. Lu, T. Zhou et al., “Continuous Speech Separation: Dataset and Analysis,” in ICASSP, 2020, pp. 7284–7288.
- C. Xu, W. Rao, E. S. Chng, and H. Li, “SpEx: Multi-Scale Time Domain Speaker Extraction Network,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1370–1384, 2020.
- Z.-Q. Wang and D. Wang, “Deep Learning Based Target Cancellation for Speech Dereverberation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 941–950, 2020.
- Z.-Q. Wang, P. Wang et al., “Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1778–1787, 2020.
- ——, “Multi-Microphone Complex Spectral Mapping for Utterance-Wise and Continuous Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2001–2014, 2021.
- K. Tan, Z.-Q. Wang et al., “Neural Spectrospatial Filtering,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 605–621, 2022.
- K. Tesch and T. Gerkmann, “Nonlinear Spatial Filtering in Multichannel Speech Enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1795–1805, 2021.
- Z. Zhang, Y. Xu, M. Yu, S. X. Zhang, L. Chen et al., “Multi-Channel Multi-Frame ADL-MVDR for Target Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3526–3540, 2021.
- Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee et al., “TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3221–3236, 2023.
- S. R. Chetupalli and E. A. Habets, “Speaker Counting and Separation From Single-Channel Noisy Mixtures,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1681–1692, 2023.
- C. Zheng, H. Zhang, W. Liu, X. Luo, A. Li et al., “Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods,” Trends in Hearing, vol. 27, 2023.
- K. Saijo, W. Zhang, Z.-Q. Wang, S. Watanabe, T. Kobayashi et al., “A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction,” in ASRU, 2023.
- J. Pons, X. Liu, S. Pascual, and J. Serra, “GASS: Generalizing Audio Source Separation with Large-Scale Data,” in ICASSP, 2024.
- C. Quan and X. Li, “SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 1310–1323, 2024.
- W. Zhang, K. Saijo, Z.-Q. Wang, S. Watanabe et al., “Toward Universal Speech Enhancement for Diverse Input Conditions,” in ASRU, 2023.
- W. Zhang, J.-w. Jung, S. Watanabe, and Y. Qian, “Improving Design of Input Condition Invariant Speech Enhancement,” in ICASSP, 2024.
- A. Pandey and D. Wang, “On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2489–2499, 2020.
- W. Zhang, J. Shi, C. Li, S. Watanabe, and Y. Qian, “Closing The Gap Between Time-Domain Multi-Channel Speech Enhancement on Real and Simulation Conditions,” in WASPAA, 2021, pp. 146–150.
- E. Tzinis, Y. Adi, V. K. Ithapu, B. Xu, P. Smaragdis, and A. Kumar, “RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing,” IEEE J. of Sel. Topics in Signal Process., vol. 16, no. 6, pp. 1329–1341, 2022.
- E. Tzinis, S. Wisdom, T. Remez, and J. R. Hershey, “AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation,” in ECCV, 2022, pp. 368–385.
- T. J. Cox, J. Barker, W. Bailey, S. Graetzer, M. A. Akeroyd, J. F. Culling, and G. Naylor, “Overview of The 2023 ICASSP SP Clarity Challenge: Speech Enhancement For Hearing Aids,” in ICASSP, 2023.
- S. Leglaive, L. Borne, E. Tzinis, M. Sadeghi, M. Fraticelli, S. Wisdom, M. Pariente et al., “The CHiME-7 UDASE Task: Unsupervised Domain Adaptation for Conversational Speech Enhancement,” in CHiME, 2023.
- S. Cornell, M. Wiesner, S. Watanabe, D. Raj, X. Chang, P. Garcia et al., “The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios,” in CHiME, 2023.
- R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeister, M. L. Seltzer et al., “Speech Processing for Digital Home Assistants: Combining Signal Processing with Deep-Learning Techniques,” IEEE Signal Process. Mag., vol. 36, no. 6, pp. 111–124, 2019.
- Y. Yang, A. Pandey, and D. Wang, “Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR,” arXiv preprint arXiv:2403.06387, 2024.
- J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,” in Interspeech, 2018, pp. 1561–1565.
- J. Carletta et al., “The AMI Meeting Corpus: A Pre-Announcement,” in Machine Learning for Multimodal Interact., vol. 3869, 2006, pp. 28–39.
- F. Yu et al., “M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge,” in ICASSP, 2022, pp. 6167–6171.
- S. Wu, C. Wang, H. Chen, Y. Dai, C. Zhang, R. Wang, H. Lan et al., “The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction,” in ICASSP, 2024.
- S. Watanabe, M. Mandel, J. Barker, E. Vincent et al., “CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings,” in arXiv preprint arXiv:2004.09249, 2020.
- S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss et al., “Unsupervised Sound Separation using Mixture Invariant Training,” in NeurIPS, 2020.
- A. Sivaraman, S. Wisdom, H. Erdogan, and J. R. Hershey, “Adapting Speech Separation To Real-World Meetings using Mixture Invariant Training,” in ICASSP, 2022, pp. 686–690.
- R. Aralikatti, C. Boeddeker, G. Wichern, A. S. Subramanian et al., “Reverberation as Supervision for Speech Separation,” in ICASSP, 2023.
- Z.-Q. Wang et al., “Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3476–3490, 2021.
- E. Vincent, S. Watanabe et al., “An Analysis of Environment, Microphone and Data Simulation Mismatches in Robust Speech Recognition,” Comp. Speech and Lang., vol. 46, pp. 535–557, 2017.
- Z.-Q. Wang, “Mixture to Mixture: Leveraging Close-talk Mixtures as Weak-supervision for Speech Separation,” arXiv preprint arXiv:2402.09313, 2024.
- R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix et al., “Far-Field Automatic Speech Recognition,” Proc. IEEE, 2020.
- J. Heymann et al., “BLSTM Supported GEV Beamformer Front-End for The 3rd CHiME Challenge,” in ASRU, 2015, pp. 444–451.
- X. Zhang, Z.-Q. Wang, and D. Wang, “A Speech Enhancement Algorithm by Iterating Single- and Multi-Microphone Processing and Its Application to Robust ASR,” in ICASSP, 2017, pp. 276–280.
- C. Boeddecker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, and R. Haeb-Umbach, “Front-End Processing for The CHiME-5 Dinner Party Scenario,” in CHiME, 2018, pp. 35–40.
- A. Narayanan and D. Wang, “Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 1, pp. 92–101, 2015.
- Z.-Q. Wang and D. Wang, “A Joint Training Framework for Robust Automatic Speech Recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 4, pp. 796–806, 2016.
- J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink et al., “BEAMNET: End-To-End Training of A Beamformer-Supported Multi-Channel ASR System,” in ICASSP, 2017, pp. 5325–5329.
- X. Chang, W. Zhang, Y. Qian, J. Le Roux, and S. Watanabe, “MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition,” in ASRU, 2019, pp. 237–244.
- X. Chang, T. Maekaku, Y. Fujita, and S. Watanabe, “End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation,” in Interspeech, 2022, pp. 3819–3823.
- T. Fujimura, Y. Koizumi, K. Yatabe, and R. Miyazaki, “Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech,” in EUSIPCO, 2021, pp. 436–440.
- Y. Bando, K. Sekiguchi, Y. Masuyama, A. A. Nugraha, M. Fontaine et al., “Neural Full-Rank Spatial Covariance Analysis for Blind Source Separation,” IEEE Signal Process. Lett., vol. 28, pp. 1670–1674, 2021.
- Z.-Q. Wang et al., “UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures,” in NeurIPS, 2023.
- Z.-Q. Wang, “USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering,” arXiv preprint arXiv:2402.00820, 2024.
- C. Han, K. Wilson, S. Wisdom et al., “Unsupervised Multi-channel Separation and Adaptation,” in arXiv preprint arXiv:2305.11151, 2023.
- J. Zhang, C. Zorilă, R. Doddipatla, and J. Barker, “On Monoaural Speech Enhancement for Automatic Recognition of Real Noisy Speech using Mixture Invariant Training,” in Interspeech, 2022, pp. 1056–1060.
- X. Hao, C. Xu, and L. Xie, “Neural Speech Enhancement with Unsupervised Pre-Training and Mixture Training,” Neural Networks, vol. 158, pp. 216–227, 2023.
- D. Stoller et al., “Adversarial Semi-Supervised Audio Source Separation Applied to Singing Voice Extraction,” in ICASSP, 2018, pp. 2391–2395.
- N. Zhang, J. Yan, and Y. Zhou, “Weakly Supervised Audio Source Separation via Spectrum Energy Preserved Wasserstein Learning,” in IJCAI, 2018, pp. 4574–4580.
- F. Pishdadian, G. Wichern et al., “Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision,” in IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, 2020, pp. 2386–2399.
- R. Talmon, I. Cohen, and S. Gannot, “Relative Transfer Function Identification using Convolutive Transfer Function Approximation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 546–555, 2009.
- S. Gannot, E. Vincent et al., “A Consolidated Perspective on Multi-Microphone Speech Enhancement and Source Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, pp. 692–730, 2017.
- A. Levin et al., “Understanding Blind Deconvolution Algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, pp. 2354–2367, 2011.
- Z.-Q. Wang, G. Wichern, and J. Le Roux, “Convolutive Prediction for Reverberant Speech Separation,” in WASPAA, 2021, pp. 56–60.
- H. Sawada et al., “A Review of Blind Source Separation Methods: Two Converging Routes to ILRMA Originating from ICA and NMF,” APSIPA Trans. on Signal and Info. Process., vol. 8, pp. 1–14, 2019.
- C. Zorilă and R. Doddipatla, “Speaker Reinforcement using Target Source Extraction for Robust Automatic Speech Recognition,” in ICASSP, 2022, pp. 6297–6301.
- Z.-Q. Wang, G. Wichern, and J. Le Roux, “On The Compensation Between Magnitude and Phase in Speech Separation,” IEEE Signal Process. Lett., vol. 28, pp. 2018–2022, 2021.
- T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phase Processing for Single-Channel Speech Enhancement: History and Recent Advances,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 55–66, 2015.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey et al., “Robust Speech Recognition via Large-Scale Weak Supervision,” Proc. of Machine Learning Research, vol. 202, pp. 28 492–28 518, 2023.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” in IEEE J. of Sel. Topics in Signal Process., vol. 16, no. 6, 2022, pp. 1505–1518.
- Y.-J. Lu, S. Cornell, X. Chang, W. Zhang, C. Li, Z. Ni, Z.-Q. Wang, and S. Watanabe, “Towards Low-Distortion Multi-Channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge,” in ICASSP, 2022, pp. 9201–9205.
- Y. Masuyama, X. Chang, S. Cornell, S. Watanabe et al., “End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation,” in SLT, 2023, pp. 260–265.