Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation (2309.17189v4)

Published 29 Sep 2023 in cs.SD, cs.CV, and eess.AS

Abstract: Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the prior SOTA method in both inference speed and separation quality while reducing the number of parameters by 90% and MACs by 83%. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727, 2018a.
  2. Lrs3-ted: a large-scale dataset for visual speech recognition, 2018b.
  3. Self-supervised learning of audio-visual objects from video. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  208–224. Springer, 2020.
  4. The conversation: deep audio-visual speech enhancement. In Interspeech, volume 2018, 2018.
  5. Adelbert Bronkhorst. The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica, 86:117–128, 01 2000.
  6. Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation. In Proc. Interspeech 2020, pp.  2642–2646, 2020. doi: 10.21437/Interspeech.2020-2205.
  7. E. Colin Cherry. Some Experiments on the Recognition of Speech, with One and with Two Ears. The Journal of the Acoustical Society of America, 25(5):975–979, 06 2005. ISSN 0001-4966.
  8. Voxceleb2: Deep speaker recognition. In Interspeech, 2018.
  9. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424, 2006.
  10. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  15490–15500. IEEE, 2021.
  11. The Cocktail Party Problem. Neural Computation, 17(9):1875–1902, 09 2005. ISSN 0899-7667.
  12. Deep clustering: Discriminative embeddings for segmentation and separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  31–35. IEEE, 2016.
  13. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  14. Speech separation using an asynchronous fully recurrent convolutional neural network. Advances in Neural Information Processing Systems (NeurIPS), 34:22509–22522, 2021.
  15. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  626–630. IEEE, 2019.
  16. Looking into your speech: Learning cross-modal affinity for audio-visual speech separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1336–1345. IEEE, 2021.
  17. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4470–4481, 2018.
  18. An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits. arXiv preprint arXiv:2212.10744, 2022.
  19. An efficient encoder-decoder architecture with top-down attention for speech separation. In International Conference on Learning Representations (ICLR), 2023.
  20. Av-sepformer: Cross-attention sepformer for audio-visual target speaker extraction. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  21. Philipos C Loizou. Speech enhancement: theory and practice. CRC press, 2013.
  22. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2018.
  23. Yi Luo and Nima Mesgarani. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  696–700. IEEE, 2018.
  24. Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, 2019.
  25. Yi Luo and Jianwei Yu. Music source separation with band-split rnn. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1893–1901, 2023. doi: 10.1109/TASLP.2023.3271145.
  26. Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing. In Automatic Speech Recognition and Understanding (ASRU), pp.  260–267. IEEE, 2019.
  27. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  46–50. IEEE, 2020.
  28. Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model. In Proc. Interspeech 2023, pp.  1673–1677, 2023.
  29. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European conference on computer vision (ECCV), pp.  631–648, 2018.
  30. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp.  234–241. Springer, 2015.
  31. Lip reading sentences in the wild. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp.  6447–6456, 2017.
  32. Attention is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  21–25. IEEE, 2021.
  33. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
  34. Performance measurement in blind audio source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 14(4):1462–1469, 2006.
  35. Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE press, 2006.
  36. Tf-gridnet: Making time-frequency domain models great again for monaural speaker separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  37. Time domain audio visual speech separation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  667–673, 2019.
  38. Tfpsnet: Time-frequency domain path scanning network for speech separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6842–6846. IEEE, 2022.
  39. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.
Citations (3)

Summary

We haven't generated a summary for this paper yet.