Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text-to-Audio Generation Synchronized with Videos (2403.07938v1)

Published 8 Mar 2024 in cs.SD, cs.AI, cs.CV, cs.LG, cs.MM, and eess.AS

Abstract: In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often results in discernible audio-visual mismatches. To bridge this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself with three novel metrics dedicated to evaluating visual alignment and temporal consistency. To complement this, we also present a simple yet effective video-aligned TTA generation model, namely T2AV. Moving beyond traditional methods, T2AV refines the latent diffusion approach by integrating visual-aligned text embeddings as its conditional foundation. It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet that adeptly merges temporal visual representations with text embeddings. Further enhancing this integration, we weave in a contrastive learning objective, designed to ensure that the visual-aligned text embeddings resonate closely with the audio features. Extensive evaluations on the AudioCaps and T2AV-Bench demonstrate that our T2AV sets a new standard for video-aligned TTA generation in ensuring visual alignment and temporal consistency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp.  609–617, 2017.
  2. Soundnet: Learning sound representations from unlabeled video. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2016.
  3. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  721–725. IEEE, 2020.
  4. Conditional generation of audio from video via foley analogies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2426–2436, 2023.
  5. Music gesture for visual sound separation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10478–10487, 2020.
  6. Cnn architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
  7. Denoising diffusion probabilistic models. In Proceedings of Advances In Neural Information Processing Systems (NeurIPS), pp.  6840–6851, 2020.
  8. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  9. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023.
  10. Taming visually guided sound generation. In Proceedings of British Machine Vision Conference (BMVC), 2021a.
  11. Taming visually guided sound generation. arXiv preprint arXiv:2110.08791, 2021b.
  12. AudioCaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  119–132, 2019.
  13. Diffwave: A versatile diffusion model for audio synthesis. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  14. Cooperative learning of audio and video models from self-supervised synchronization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
  15. Audiogen: Textually guided audio generation. In Proceedings of International Conference on Learning Representations (ICLR), 2023.
  16. Dual-modality seq2seq network for audio-visual event localization. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  2002–2006. IEEE, 2019.
  17. Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
  18. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
  19. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of ACM International Conference on Multimedia (ACMMM), 2022.
  20. Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations (ICLR), 2013.
  21. Localizing visual sounds the easy way. In Proceedings of European Conference on Computer Vision (ECCV), pp.  218–234, 2022a.
  22. A closer look at weakly-supervised audio-visual source localization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022b.
  23. Benchmarking weakly-supervised audio-visual sound localization. In European Conference on Computer Vision (ECCV) Workshop, 2022c.
  24. Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling. arXiv preprint arXiv:2312.01017, 2023a.
  25. A unified audio-visual learning framework for localization, separation, and recognition. arXiv preprint arXiv:2305.19458, 2023b.
  26. Weakly-supervised audio-visual segmentation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2023.
  27. Multi-modal grouping network for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022a.
  28. Semantic-aware multi-modal grouping for weakly-supervised audio-visual video parsing. In European Conference on Computer Vision (ECCV) Workshop, 2022b.
  29. Audio-visual grouping network for sound localization from mixtures. arXiv preprint arXiv:2303.17056, 2023a.
  30. AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836, 2023b.
  31. Class-incremental grouping network for continual audio-visual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023a.
  32. DiffAVA: Personalized text-to-audio generation with visual alignment. arXiv preprint arXiv:2305.12903, 2023b.
  33. Self-supervised generation of spatial audio for 360°video. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
  34. Learning representations from audio-visual spatial alignment. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pp.  4733–4744, 2020.
  35. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12934–12945, 2021a.
  36. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  12475–12486, June 2021b.
  37. Ambient sound provides supervision for visual learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  801–816, 2016.
  38. Audio-visual class-incremental learning. In IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  39. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  40. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
  41. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  42. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4358–4366, 2018.
  43. Score-based generative modeling through stochastic differential equations. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  44. Audio-visual event localization in unconstrained videos. In Proceedings of European Conference on Computer Vision (ECCV), 2018.
  45. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proceedings of European Conference on Computer Vision (ECCV), pp.  436–454, 2020.
  46. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  47. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1326–1335, 2021.
  48. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  6292–6300, 2019.
  49. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
  50. Diffsound: Discrete diffusion model for text-to-sound generation. arXiv preprint arXiv:2207.09983, 2022.
  51. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  52. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  570–586, 2018.
  53. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  1735–1744, 2019.
  54. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3550–3558, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shentong Mo (56 papers)
  2. Jing Shi (123 papers)
  3. Yapeng Tian (80 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com