Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TAVGBench: Benchmarking Text to Audible-Video Generation (2404.14381v1)

Published 22 Apr 2024 in cs.CV and cs.MM

Abstract: The Text to Audible-Video Generation (TAVG) task involves generating videos with accompanying audio based on text descriptions. Achieving this requires skillful alignment of both audio and video elements. To support research in this field, we have developed a comprehensive Text to Audible-Video Generation Benchmark (TAVGBench), which contains over 1.7 million clips with a total duration of 11.8 thousand hours. We propose an automatic annotation pipeline to ensure each audible video has detailed descriptions for both its audio and video contents. We also introduce the Audio-Visual Harmoni score (AVHScore) to provide a quantitative measure of the alignment between the generated audio and video modalities. Additionally, we present a baseline model for TAVG called TAVDiffusion, which uses a two-stream latent diffusion model to provide a fundamental starting point for further research in this area. We achieve the alignment of audio and video by employing cross-attention and contrastive learning. Through extensive experiments and evaluations on TAVGBench, we demonstrate the effectiveness of our proposed model under both conventional metrics and our proposed metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In International Conference on Computer Vision (ICCV). 1728–1738.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR). 22563–22575.
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Conference on Computer Vision and Pattern Recognition (CVPR). 6299–6308.
  4. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345 (2023).
  5. Conditional generation of audio from video via foley analogies. In Conference on Computer Vision and Pattern Recognition (CVPR). 2426–2436.
  6. Clap learning audio concepts from natural language supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  7. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776–780.
  8. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731 (2023).
  9. Imagebind: One embedding space to bind them all. In Conference on Computer Vision and Pattern Recognition (CVPR). 15180–15190.
  10. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. In International Conference on Learning Representations (ICLR).
  11. Momentum contrast for unsupervised visual representation learning. In Conference on Computer Vision and Pattern Recognition (CVPR). 9729–9738.
  12. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 7514–7528.
  13. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022).
  14. Denoising diffusion probabilistic models. Conference on Neural Information Processing Systems (NeurIPS) (2020), 6840–6851.
  15. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. In International Conference on Learning Representations (ICLR).
  16. Vladimir Iashin and Esa Rahtu. 2021. Taming Visually Guided Sound Generation. In British Machine Vision Conference (BMVC). BMVA Press.
  17. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
  18. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In International Conference on Computer Vision (ICCV). 15954–15964.
  19. AudioCaps: Generating Captions for Audios in The Wild. In NAACL-HLT.
  20. AudioGen: Textually Guided Audio Generation. In International Conference on Learning Representations (ICLR).
  21. Melgan: Generative adversarial networks for conditional waveform synthesis. Conference on Neural Information Processing Systems (NeurIPS) 32 (2019).
  22. Soundini: Sound-guided diffusion for natural video editing. arXiv preprint arXiv:2304.06818 (2023).
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning (ICML). PMLR, 19730–19742.
  24. Video generation from text. In AAAI Conference on Artificial Intelligence (AAAI), Vol. 32.
  25. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In International Conference on Learning Representations (ICLR). PMLR, 21450–21474.
  26. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. Conference on Neural Information Processing Systems (NeurIPS) 36 (2024).
  27. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395 (2023).
  28. Drumgan: Synthesis of drum sounds with timbral feature conditioning using generative adversarial networks. arXiv preprint arXiv:2008.12073 (2020).
  29. OpenAI. 2022. ChatGPT: OpenAI’s Conversational AI. https://openai.com/chatgpt.
  30. Pytorch: An imperative style, high-performance deep learning library. Conference on Neural Information Processing Systems (NeurIPS) 32 (2019).
  31. William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In International Conference on Computer Vision (ICCV). 4195–4205.
  32. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). PMLR, 8748–8763.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR) 21, 140 (2020), 1–67.
  34. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
  35. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). Springer, 234–241.
  36. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Conference on Computer Vision and Pattern Recognition (CVPR). 10219–10228.
  37. Sakib Shahriar. 2022. GAN computers generate arts? A survey on visual arts, music, and literary text generation using generative adversarial network. Displays 73 (2022), 102237.
  38. Fine-grained audible video description. In Conference on Computer Vision and Pattern Recognition (CVPR). 10585–10596.
  39. Make-A-Video: Text-to-Video Generation without Text-Video Data. In International Conference on Learning Representations (ICLR).
  40. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In International Conference on Machine Learning (ICML). 2256–2265.
  41. Denoising Diffusion Implicit Models. In International Conference on Learning Representations (ICLR).
  42. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 568 (2024), 127063.
  43. Sound to visual scene generation by audio-to-visual latent alignment. In Conference on Computer Vision and Pattern Recognition (CVPR). 6430–6440.
  44. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018).
  45. FVD: A new metric for video generation. (2019).
  46. Attention is all you need. Conference on Neural Information Processing Systems (NeurIPS) 30 (2017).
  47. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103 (2023).
  48. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806 (2021).
  49. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In International Conference on Computer Vision (ICCV). 7623–7633.
  50. Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners. arXiv preprint arXiv:2402.17723 (2024).
  51. Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710 (2023).
  52. Msr-vtt: A large video description dataset for bridging video and language. In Conference on Computer Vision and Pattern Recognition (CVPR). 5288–5296.
  53. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE-ACM T AUDIO SPE) (2023).
  54. Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation. arXiv:2309.16429
  55. Magvit: Masked generative video transformer. In Conference on Computer Vision and Pattern Recognition (CVPR). 10459–10469.
  56. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818 (2023).
  57. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 543–553.
  58. Moviefactory: Automatic movie creation from text using large generative models for language and images. In ACM Multimedia Conference (ACM MM). 9313–9319.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuxin Mao (19 papers)
  2. Xuyang Shen (23 papers)
  3. Jing Zhang (731 papers)
  4. Zhen Qin (105 papers)
  5. Jinxing Zhou (16 papers)
  6. Mochu Xiang (9 papers)
  7. Yiran Zhong (75 papers)
  8. Yuchao Dai (123 papers)
Citations (5)
Youtube Logo Streamline Icon: https://streamlinehq.com