A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement (2403.01369v1)
Abstract: Self-supervised learned models have been found to be very effective for certain speech tasks such as automatic speech recognition, speaker identification, keyword spotting and others. While the features are undeniably useful in speech recognition and associated tasks, their utility in speech enhancement systems is yet to be firmly established, and perhaps not properly understood. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions and find that they add very little value for the enhancement task. Our constraints are designed around on-device real-time speech enhancement -- model is causal, the compute footprint is small. Additionally, we focus on low SNR conditions where such models struggle to provide good enhancement. In order to systematically examine how SSL representations impact performance of such enhancement models, we propose a variety of techniques to utilize these embeddings which include different forms of knowledge-distillation and pre-training.
- “Single channel speech enhancement techniques in spectral domain,” ISRN Mechanical Engineering, vol. 2012, 07 2012.
- “Complex ratio masking for joint enhancement of magnitude and phase,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 5220–5224.
- “Real time speech enhancement in the waveform domain,” arXiv preprint arXiv:2006.12847, 2020.
- “Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6633–6637.
- “Dpcrn: Dual-path convolution recurrent network for single channel speech enhancement,” arXiv preprint arXiv:2107.05429, 2021.
- “Deep complex networks,” ArXiv, vol. abs/1705.09792, 2017.
- Ke Tan and DeLiang Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380–390, 2019.
- “Phase-aware speech enhancement with deep complex u-net,” in International Conference on Learning Representations, 2019.
- “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Proc. Interspeech 2020, 2020, pp. 2472–2476.
- “Music source separation in the waveform domain,” CoRR, vol. abs/1911.13254, 2019.
- “Speech enhancement and dereverberation with diffusion-based generative models,” 2023.
- “Universal speech enhancement with score-based diffusion,” 2022.
- “Self-supervised learning for speech enhancement through synthesis,” 2022.
- “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7402–7406.
- “Incorporating real-world noisy speech in neural-network-based speech enhancement systems,” 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 564–570, 2021.
- “Remixit: Continual self-training of speech enhancement models via bootstrapped remixing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1329–1341, 2022.
- “A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1826–1838, 2020.
- “Noisy-target training: A training strategy for dnn-based speech enhancement without clean speech,” in 2021 29th European Signal Processing Conference (EUSIPCO). IEEE, 2021, pp. 436–440.
- “Improving multimodal speech enhancement by incorporating self-supervised and curriculum learning,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 4285–4289.
- “Self-supervised denoising autoencoder with linear regression decoder for speech enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6669–6673.
- “Language modeling with gated convolutional networks,” in International Conference on Machine Learning, 2016.
- “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Cham, 2015, pp. 234–241, Springer International Publishing.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” arXiv preprint arXiv:2110.13900, 2021.
- “Boosting Self-Supervised Embeddings for Speech Enhancement,” in Proc. Interspeech 2022, 2022, pp. 186–190.
- “Investigating self-supervised learning for speech enhancement and separation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6837–6841, 2022.
- “A systematic comparison of phonetic aware techniques for speech enhancement,” in Interspeech, 2022.
- “Incorporating broad phonetic information for speech enhancement,” 2020.
- “Speech recognition on mobile devices in noisy environments,” in 2018 26th Signal Processing and Communications Applications Conference (SIU), 2018, pp. 1–4.
- “Reducing barriers to self-supervised learning: Hubert pre-training with academic compute,” arXiv preprint arXiv:2306.06672, 2023.
- “Sdr – half-baked or well done?,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630.
- “A preliminary study on wav2vec 2.0 embeddings for text-to-speech,” in 2021 International Conference on Information and Communication Technology Convergence (ICTC), 2021, pp. 343–347.
- “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” in Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, Eds., 2021.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, NIPS’20, Curran Associates Inc.
- “Multi-class generative adversarial networks with the L2 loss function,” CoRR, vol. abs/1611.04076, 2016.
- “Interspeech 2021 deep noise suppression challenge,” 2021.
- “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.