Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robustness of Speech Separation Models for Similar-pitch Speakers (2407.15749v1)

Published 22 Jul 2024 in eess.AS, cs.LG, and eess.SP

Abstract: Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments. This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal. Building on earlier findings by Ditter and Gerkmann, which identified a significant performance drop for the 2018 Chimera++ under similar-pitch conditions, our study extends the analysis to more recent and sophisticated Neural Network models. Our experiments reveal that modern models have substantially reduced the performance gap for matched training and testing conditions. However, a substantial performance gap persists under mismatched conditions, with models performing well for large pitch differences but showing worse performance if the speakers' pitches are similar. These findings motivate further research into the generalizability of speech separation models to similar-pitch speakers and unseen data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. “Deep clustering: Discriminative embeddings for segmentation and separation,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 31–35, 2016.
  2. “Deep clustering and conventional networks for music separation: Stronger together,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 61–65, 2017.
  3. “Alternative objective functions for deep clustering,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 686–690, 2018.
  4. Yi Luo and Nima Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE Trans. on Audio, Speech, and Language Proc., vol. 27, no. 8, pp. 1256–1266, 2019.
  5. “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 241–245, 2017.
  6. “Attention is all you need in speech separation,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2021.
  7. “Mossformer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2023.
  8. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  9. “Influence of speaker-specific parameters on speech separation systems.,” ISCA Interspeech, pp. 4584–4588, 2019.
  10. “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 2351–2364, 2023.
  11. “Reducing the prior mismatch of stochastic differential equations for diffusion-based speech enhancement,” ISCA Interspeech, 2023.
  12. “Single and few-step diffusion for generative speech enhancement,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2024.
  13. “A multi-phase gammatone filterbank for speech separation via TasNet,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), May 2020.
  14. “Single-channel multi-speaker separation using deep clustering,” ISCA Interspeech, pp. 545–549, 09 2016.
  15. “The design for the wall street journal-based csr corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman. 1992, Morgan Kaufmann Publishers.
  16. “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  17. “Pyin: A fundamental frequency estimator using probabilistic threshold distributions,” 2014.
  18. “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
  19. “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), vol. 2, pp. 749–752, 2001.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com