Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SICRN: Advancing Speech Enhancement through State Space Model and Inplace Convolution Techniques (2402.14225v1)

Published 22 Feb 2024 in eess.AS and cs.SD

Abstract: Speech enhancement aims to improve speech quality and intelligibility, especially in noisy environments where background noise degrades speech signals. Currently, deep learning methods achieve great success in speech enhancement, e.g. the representative convolutional recurrent neural network (CRN) and its variants. However, CRN typically employs consecutive downsampling and upsampling convolution for frequency modeling, which destroys the inherent structure of the signal over frequency. Additionally, convolutional layers lacks of temporal modelling abilities. To address these issues, we propose an innovative module combing a State space model and Inplace Convolution (SIC), and to replace the conventional convolution in CRN, called SICRN. Specifically, a dual-path multidimensional State space model captures the global frequencies dependency and long-term temporal dependencies. Meanwhile, the 2D-inplace convolution is used to capture the local structure, which abandons the downsampling and upsampling. Systematic evaluations on the public INTERSPEECH 2020 DNS challenge dataset demonstrate SICRN's efficacy. Compared to strong baselines, SICRN achieves performance close to state-of-the-art while having advantages in model parameters, computations, and algorithmic delay. The proposed SICRN shows great promise for improved speech enhancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6875–6879.
  2. “Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6629–6633.
  3. “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.
  4. “Ffc-se: Fast fourier convolution for speech enhancement,” 2022.
  5. DeLiang Wang, “Time-frequency masking for speech separation and its potential for hearing aid design,” Trends in amplification, vol. 12, no. 4, pp. 332–353, 2008.
  6. “Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7857–7861.
  7. Ke Tan and DeLiang Wang, “A convolutional recurrent neural network for real-time speech enhancement.,” in Interspeech, 2018, vol. 2018, pp. 3229–3233.
  8. “Inplace gated convolutional recurrent neural network for dual-channel speech enhancement,” arXiv preprint arXiv:2107.11968, 2021.
  9. “Iccrn: Inplace cepstral convolutional recurrent neural network for monaural speech enhancement,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  10. “A complex spectral mapping with inplace convolution recurrent neural networks for acoustic echo cancellation,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 751–755.
  11. “Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6633–6637.
  12. “S4nd: Modeling images and videos as multidimensional signals with state spaces,” Advances in neural information processing systems, vol. 35, pp. 2846–2861, 2022.
  13. “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
  14. “It’s raw! audio generation with state-space models,” in International Conference on Machine Learning. PMLR, 2022, pp. 7616–7633.
  15. “A multi-dimensional deep structured state space approach to speech enhancement using small-footprint models,” arXiv preprint arXiv:2306.00331, 2023.
  16. Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  17. “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
  18. “Multichannel audio database in various acoustic environments,” in 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2014, pp. 313–317.
  19. “A summary of the reverb challenge: state-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, vol. 2016, pp. 1–19, 2016.
  20. “Weighted speech distortion losses for neural-network-based real-time speech enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 871–875.
  21. “Dual-signal transformation lstm network for real-time noise suppression,” arXiv preprint arXiv:2005.07551, 2020.
  22. “Exploring the best loss function for dnn-based low-latency speech enhancement with temporal convolutional networks,” arXiv preprint arXiv:2005.11611, 2020.
  23. “Poconet: Better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss,” arXiv preprint arXiv:2008.04470, 2020.
Citations (2)

Summary

We haven't generated a summary for this paper yet.