Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 26 tok/s Pro
GPT-4o 86 tok/s
GPT OSS 120B 452 tok/s Pro
Kimi K2 211 tok/s Pro
2000 character limit reached

ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning (2403.01792v1)

Published 4 Mar 2024 in cs.SD and eess.AS

Abstract: Speech separation has recently made significant progress thanks to the fine-grained vision used in time-domain methods. However, several studies have shown that adopting Short-Time Fourier Transform (STFT) for feature extraction could be beneficial when encountering harsher conditions, such as noise or reverberation. Therefore, we propose a magnitude-conditioned time-domain framework, ConSep, to inherit the beneficial characteristics. The experiment shows that ConSep promotes performance in anechoic, noisy, and reverberant settings compared to two celebrated methods, SepFormer and Bi-Sep. Furthermore, we visualize the components of ConSep to strengthen the advantages and cohere with the actualities we have found in preliminary studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in INTERSPEECH, 2016.
  2. M.Kolbæk, D.Yu, Z.Tan, and J.Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 25, no. 10, pp. 1901–1913, 2017.
  3. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, no. 8, pp. 1256–1266, Aug. 2019.
  4. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in ICASSP, 2020.
  5. N. Zeghidour and D. Grangier, “Wavesplit: End-to-End speech separation by speaker clustering,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 2840–2849, 2021.
  6. J. Chen, Q. Mao, and D. Liu, “Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation,” in INTERSPEECH, 2020.
  7. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention Is All You Need In Speech Separation,” in ICASSP, 2021.
  8. M. Maciejewski, G. Wichern, E. McQuinn, and J. L. Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” in ICASSP, 2020.
  9. C. Subakan, M. Ravanelli, S. Cornell, F. Grondin, and M. Bronzi, “On using transformers for speech-separation,” arXiv preprint arXiv:2202.02884, 2022.
  10. J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in ICASSP, 2019.
  11. S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. Hershey, “Unsupervised sound separation using mixture invariant training,” in Advances in Neural Information Processing Systems, 2020.
  12. J. Heitkaemper, D. Jakobeit, C. Boeddeker, L. Drude, and R. Haeb-Umbach, “Demystifying TasNet: A dissecting approach,” in ICASSP, 2020.
  13. D. Ditter and T. Gerkmann, “A Multi-Phase Gammatone Filterbank for Speech Separation Via Tasnet,” in ICASSP, 2020.
  14. Y. Avargel and I. Cohen, “On multiplicative transfer function approximation in the short-time fourier transform domain,” IEEE Signal Process. Lett., vol. 14, no. 5, pp. 337–340, 2007.
  15. D. Wang and J. Lim, “The unimportance of phase in speech enhancement,” IEEE Trans. Audio Speech Lang. Process., vol. 30, no. 4, pp. 679–681, Aug. 1982.
  16. T. Peer and T. Gerkmann, “Intelligibility Prediction of Speech Reconstructed From Its Magnitude or Phase,” in ITG Conference on Speech Communication, 2021.
  17. T. Peer and T. Gerkmann, “Phase-aware deep speech enhancement: It’s all about the frame length,” arXiv preprint arXiv:2203.16222, 2022.
  18. N. Takahashi and Y.Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in WASPAA, 2017.
  19. Rong Chao, Cheng Yu, Szu-Wei Fu, Xugang Lu, and Yu Tsao, “Perceptual contrast stretching on target feature for speech enhancement,” in INTERSPEECH, 2022.
  20. E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in AAAI-18, 2018.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com