ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning (2403.01792v1)
Abstract: Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit the beneficial characteristics. The experiment shows that ConSep promotes performance in anechoic, noisy, and reverberant settings compared to two celebrated methods, SepFormer and Bi-Sep. Furthermore, we visualize the components of ConSep to strengthen the advantages and cohere with the actualities we have found in preliminary studies.
- Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in INTERSPEECH, 2016.
- M.Kolbæk, D.Yu, Z.Tan, and J.Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 8, pp. 1256–1266, Aug. 2019.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in ICASSP, 2020.
- N. Zeghidour and D. Grangier, “Wavesplit: End-to-End speech separation by speaker clustering,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 2840–2849, 2021.
- J. Chen, Q. Mao, and D. Liu, “Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation,” in INTERSPEECH, 2020.
- C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention Is All You Need In Speech Separation,” in ICASSP, 2021.
- M. Maciejewski, G. Wichern, E. McQuinn, and J. L. Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” in ICASSP, 2020.
- C. Subakan, M. Ravanelli, S. Cornell, F. Grondin, and M. Bronzi, “On using transformers for speech-separation,” arXiv preprint arXiv:2202.02884, 2022.
- J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in ICASSP, 2019.
- S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. Hershey, “Unsupervised sound separation using mixture invariant training,” in Advances in Neural Information Processing Systems, 2020.
- J. Heitkaemper, D. Jakobeit, C. Boeddeker, L. Drude, and R. Haeb-Umbach, “Demystifying TasNet: A dissecting approach,” in ICASSP, 2020.
- D. Ditter and T. Gerkmann, “A Multi-Phase Gammatone Filterbank for Speech Separation Via Tasnet,” in ICASSP, 2020.
- Y. Avargel and I. Cohen, “On multiplicative transfer function approximation in the short-time fourier transform domain,” IEEE Signal Process. Lett., vol. 14, no. 5, pp. 337–340, 2007.
- D. Wang and J. Lim, “The unimportance of phase in speech enhancement,” IEEE Trans. Audio Speech Lang. Process., vol. 30, no. 4, pp. 679–681, Aug. 1982.
- T. Peer and T. Gerkmann, “Intelligibility Prediction of Speech Reconstructed From Its Magnitude or Phase,” in ITG Conference on Speech Communication, 2021.
- T. Peer and T. Gerkmann, “Phase-aware deep speech enhancement: It’s all about the frame length,” arXiv preprint arXiv:2203.16222, 2022.
- N. Takahashi and Y.Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in WASPAA, 2017.
- Rong Chao, Cheng Yu, Szu-Wei Fu, Xugang Lu, and Yu Tsao, “Perceptual contrast stretching on target feature for speech enhancement,” in INTERSPEECH, 2022.
- E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in AAAI-18, 2018.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run custom paper prompts using GPT-5 on this paper.