Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Data Sampling Strategies for Training Neural Network Speech Separation Models (2304.07142v2)

Published 14 Apr 2023 in cs.SD, cs.AI, cs.LG, cs.NE, and eess.AS

Abstract: Speech separation remains an important area of multi-speaker signal processing. Deep neural network (DNN) models have attained the best performance on many speech separation benchmarks. Some of these models can take significant time to train and have high memory requirements. Previous work has proposed shortening training examples to address these issues but the impact of this on model performance is not yet well understood. In this work, the impact of applying these training signal length (TSL) limits is analysed for two speech separation models: SepFormer, a transformer model, and Conv-TasNet, a convolutional model. The WJS0-2Mix, WHAMR and Libri2Mix datasets are analysed in terms of signal length distribution and its impact on training efficiency. It is demonstrated that, for specific distributions, applying specific TSL limits results in better performance. This is shown to be mainly due to randomly sampling the start index of the waveforms resulting in more unique examples for training. A SepFormer model trained using a TSL limit of 4.42s and dynamic mixing (DM) is shown to match the best-performing SepFormer model trained with DM and unlimited signal lengths. Furthermore, the 4.42s TSL limit results in a 44% reduction in training time with WHAMR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix, and T. Nakatani, “Far-Field Automatic Speech Recognition,” Proceedings of the IEEE, vol. 109, no. 2, pp. 124–148, 2021.
  2. N. Moritz, K. Adiloğlu, J. Anemüller, S. Goetze, and B. Kollmeier, “Multi-channel speech enhancement and amplitude modulation analysis for noise robust automatic speech recognition,” Computer Speech & Language, vol. 46, pp. 558–573, November 2017.
  3. D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe, J. Du, T. Yoshioka, Y. Luo, N. Kanda, J. Li, S. Wisdom, and J. R. Hershey, “Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis,” SLT 2021, Jan 2021.
  4. B. J. Borgström, M. S. Brandstein, G. A. Ciccarelli, T. F. Quatieri, and C. J. Smalt, “Speaker separation in realistic noise environments with applications to a cognitively-controlled hearing aid,” Neural Networks, vol. 140, pp. 136–147, 2021.
  5. J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth CHiME Speech Separation and Recognition Challenge: Dataset, task and baselines,” in Interspeech 2018, Sep. 2018.
  6. T. Cord-Landwehr, C. Boeddeker, T. von Neumann, C. Zorila, R. Doddipatla, and R. Haeb-Umbach, “Monaural source separation: From anechoic to reverberant environments,” in IWAENC 2022, Sep. 2022.
  7. W. Ravenscroft, S. Goetze, and T. Hain, “Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures,” Frontiers in Signal Processing, vol. 2, 2022.
  8. D. Ditter and T. Gerkmann, “Influence of Speaker-Specific Parameters on Speech Separation Systems,” in Interspeech 2019, Sep. 2019.
  9. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in Interspeech 2021, July 2021.
  10. J. Rixen and M. Renz, “QDPN - Quasi-dual-path Network for single-channel Speech Separation,” in Interspeech 2022, Sep. 2022.
  11. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020, Oct 2019.
  12. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in ICML 2020, Jul. 2020.
  13. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  14. N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2840–2849, 2021.
  15. M. Kolbaek, D. Yu, Z.-H. Tan, J. Jensen, M. Kolbaek, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 25, no. 10, p. 1901–1913, oct 2017.
  16. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?” in ICASSP 2019, May 2019.
  17. C. Subakan, M. Ravanelli, S. Cornell, F. Grondin, and M. Bronzi, “On using transformers for speech-separation,” 2022. [Online]. Available: https://arxiv.org/abs/2202.02884
  18. Y. Isik, J. Le Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-Channel Multi-Speaker Separation using Deep Clustering,” in ICASSP 2016, Sep. 2016.
  19. M. Maciejewski, G. Wichern, and J. Le Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” in ICASSP 2020, May 2020.
  20. G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. Le Roux, “WHAM!: Extending speech separation to noisy environments,” in Interspeech 2019, Sep. 2019.
  21. J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Librimix: An open-source dataset for generalizable speech separation,” 2020. [Online]. Available: https://arxiv.org/abs/2005.11262
  22. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP 2015, April 2015.
  23. W. Ravenscroft, S. Goetze, and T. Hain, “Receptive Field Analysis of Temporal Convolutional Networks for Monaural Speech Dereverberation,” in EUSIPCO 2022, Aug. 2022.
  24. ——, “Deformable temporal convolutional networks for monaural noisy reverberant speech separation,” in ICASSP 2023, Jun. 2023.
Citations (6)

Summary

We haven't generated a summary for this paper yet.