Papers
Topics
Authors
Recent
Search
2000 character limit reached

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Published 19 Apr 2024 in cs.SD, cs.CV, cs.LG, cs.MM, and eess.AS | (2404.12725v2)

Abstract: The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. The conversation: Deep audio-visual speech enhancement. In INTERSPEECH, pages 3244–3248. ISCA, 2018.
  2. LRS3-TED: a large-scale dataset for visual speech recognition. CoRR, abs/1809.00496, 2018.
  3. Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell., 44(12):8717–8727, 2022.
  4. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Commun., 95:40–67, 2017.
  5. Adelbert W Bronkhorst. The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica, 86(1):117–128, 2000.
  6. Diffv2s: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding. CoRR, abs/2308.07787, 2023.
  7. Intelligible lip-to-speech synthesis with speech units. CoRR, abs/2305.19603, 2023.
  8. Voxceleb2: Deep speaker recognition. In INTERSPEECH, pages 1086–1090. ISCA, 2018.
  9. SVTS: scalable video-to-speech synthesis. In INTERSPEECH, pages 1836–1840. ISCA, 2022.
  10. John R Deller Jr. Discrete-time processing of speech signals. In Discrete-time processing of speech signals, pages 908–908. 1993.
  11. The speech chain. Macmillan, 1993.
  12. Improving multi-modal learning with uni-modal teachers. CoRR, abs/2106.11059, 2021.
  13. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph., 37(4):112, 2018.
  14. PMR: prototypical modal rebalance for multimodal learning. In CVPR, pages 20029–20038. IEEE, 2023.
  15. Multimodal imbalance-aware gradient modulation for weakly-supervised audio-visual video parsing. CoRR, abs/2307.02041, 2023.
  16. Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR, pages 15495–15505. Computer Vision Foundation / IEEE, 2021.
  17. Spex+: A complete time domain speaker extraction network. In INTERSPEECH, pages 1406–1410. ISCA, 2020.
  18. TCD-TIMIT: an audio-visual corpus of continuous speech. IEEE Trans. Multim., 17(5):603–615, 2015.
  19. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29:3451–3460, 2021.
  20. Lip to speech synthesis with visual context attentional GAN. In NeurIPS, pages 2758–2770, 2021.
  21. Lip-to-speech synthesis in the wild with multi-task learning. CoRR, abs/2302.08841, 2023.
  22. Seeing through the conversation: Audio-visual speech separation based on diffusion model. CoRR, abs/2310.19581, 2023.
  23. The effects of audiovisual inputs on solving the cocktail party problem in the human brain: An fmri study. Cerebral Cortex, 28(10):3623–3637, 2018.
  24. An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits. CoRR, abs/2212.10744, 2022.
  25. Av-sepformer: Cross-attention sepformer for audio-visual target speaker extraction. CoRR, abs/2306.14170, 2023.
  26. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. CoRR, abs/2306.17203, 2023.
  27. Separate and diffuse: Using a pretrained diffusion model for improving source separation. CoRR, abs/2301.10752, 2023.
  28. Lipreading using temporal convolutional networks. In ICASSP, pages 6319–6323. IEEE, 2020.
  29. Speech perception by ear and eye: A paradigm for psychological inquiry. Psychology Press, 2014.
  30. Review of end-to-end speech synthesis technology based on deep learning. CoRR, abs/2104.09995, 2021.
  31. Self-supervised disentangled representation learning for robust target speech extraction. CoRR, abs/2312.10305, 2023.
  32. Multimodal speakerbeam: Single channel target speech extraction with audio-visual speaker clues. In INTERSPEECH, pages 2718–2722. ISCA, 2019.
  33. Muse: Multi-modal target speaker extraction with visual cues. In ICASSP, pages 6678–6682. IEEE, 2021.
  34. Learning individual speaking styles for accurate lip to speech synthesis. In CVPR, pages 13793–13802. Computer Vision Foundation / IEEE, 2020.
  35. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In ICASSP, pages 749–752. IEEE, 2001.
  36. SDR - half-baked or well done? In ICASSP, pages 626–630. IEEE, 2019.
  37. Learning audio-visual speech representation by masked multimodal cluster prediction. In ICLR. OpenReview.net, 2022.
  38. Robust self-supervised audio-visual speech recognition. In INTERSPEECH, pages 2118–2122. ISCA, 2022.
  39. Limits of perceived audio-visual spatial coherence as defined by reaction time measurements. Frontiers in neuroscience, 13:451, 2019.
  40. Attention is all you need in speech separation. In ICASSP, pages 21–25. IEEE, 2021.
  41. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  42. Performance measurement in blind audio source separation. IEEE Trans. Speech Audio Process., 14(4):1462–1469, 2006.
  43. Time domain audio visual speech separation. In ASRU, pages 667–673. IEEE, 2019.
  44. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 24043–24055. PMLR, 2022.
  45. Spex: Multi-scale time domain speaker extraction network. IEEE ACM Trans. Audio Speech Lang. Process., 28:1370–1384, 2020.
  46. Sepfusion: Finding optimal fusion structures for visual sound separation. In AAAI, pages 3544–3552. AAAI Press, 2022.
Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 4 likes about this paper.