Noise-Aware Speech Separation with Contrastive Learning (2305.10761v3)
Abstract: Recently, speech separation (SS) task has achieved remarkable progress driven by deep learning technique. However, it is still challenging to separate target speech from noisy mixture, as the neural model is vulnerable to assign background noise to each speaker. In this paper, we propose a noise-aware SS (NASS) method, which aims to improve the speech quality for separated signals under noisy conditions. Specifically, NASS views background noise as an additional output and predicts it along with other speakers in a mask-based manner. To effectively denoise, we introduce patch-wise contrastive learning (PCL) between noise and speaker representations from the decoder input and encoder output. PCL loss aims to minimize the mutual information between predicted noise and other speakers at multiple-patch level to suppress the noise information in separated signals. Experimental results show that NASS achieves 1 to 2dB SI-SNRi or SDRi over DPRNN and Sepformer on WHAM! and LibriMix noisy datasets, with less than 0.1M parameter increase.
- Adelbert W Bronkhorst, “The cocktail-party problem revisited: early processing and selection of multi-talker speech,” Attention, Perception, & Psychophysics, vol. 77, no. 5, pp. 1465–1487, 2015.
- Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 46–50.
- “Attention is all you need in speech separation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 21–25.
- “Monaural speech separation using dual-output deep neural network with multiple joint constraint,” Chinese Journal of Electronics, vol. 32, no. 3, pp. 493–506, 2023.
- “Monaural speech separation method based on deep learning feature fusion and joint constraints,” Journal of Electronics and Information Technology, vol. 44, no. 9, pp. 1–11, 2022.
- “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 31–35.
- “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
- “Metric-oriented speech enhancement using diffusion probabilistic model,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Multitask-based joint learning approach to robust asr for radio communication speech,” in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021, pp. 497–502.
- “Interactive feature fusion for end-to-end noise-robust speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6292–6296.
- “Mutual information neural estimation,” in International conference on machine learning. PMLR, 2018, pp. 531–540.
- “Contrastive learning for unpaired image-to-image translation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 2020, pp. 319–345.
- “Noise-robust speech recognition with 10 minutes unparalleled in-domain data,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4298–4302.
- “Wham!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160, 2019.
- “Sdr–half-baked or well done?,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
- “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
- “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
- “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
- “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.