Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs (2308.13812v2)

Published 26 Aug 2023 in cs.AI and cs.CV

Abstract: Text-to-video (T2V) synthesis has gained increasing attention in the community, in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation, they may largely suffer from key limitations (e.g., action occurrence disorders, crude video motions) with respect to the intricate temporal dynamics modeling, one of the crux of video synthesis. In this work, we investigate strengthening the awareness of video dynamics for DMs, for high-quality T2V generation. Inspired by human intuition, we design an innovative dynamic scene manager (dubbed as Dysen) module, which includes (step-1) extracting from input text the key actions with proper time-order arrangement, (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations, and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via in-context learning, Dysen realizes (nearly) human-level temporal dynamics understanding. Finally, the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features, integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins, especially in scenarios with complex actions. Codes at https://haofei.vip/Dysen-VDM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Towards high resolution video generation with progressive growing of sliced wasserstein gans. CoRR, abs/1810.02419, 2018.
  2. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. CoRR, abs/2304.08477, 2023.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the ICCV, pages 1708–1718, 2021.
  4. A flow-based latent state generative model of neural population responses to natural images. In Proceedings of the NeurIPS, pages 15801–15815, 2021.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. CoRR, abs/2304.08818, 2023.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Adversarial video generation on complex datasets. CoRR, abs/1907.06571, 2019.
  8. From object-attribute-relation semantic representation to video generation: A multiple variational autoencoder approach. In Proceedings of the MLSP, pages 1–6, 2022.
  9. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  10. Stylevideogan: A temporal generative model using a pretrained stylegan. In Proceedings of the BMVC, pages 220–220, 2021.
  11. Long video generation with time-agnostic VQGAN and time-sensitive transformer. In Proceedings of the ECCV, pages 102–118, 2022.
  12. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  13. Generative adversarial nets. In Proceedings of the NeurIPS, pages 2672–2680, 2014.
  14. Latent neural differential equations for video generation. In Proceedings of the NeurIPS, pages 73–86, 2020.
  15. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597, 2023.
  16. Can chatgpt boost artistic creation: The need of imaginative intelligence for parallel art. IEEE/CAA Journal of Automatica Sinica, 10(4):835–838, 2023.
  17. Flexible diffusion modeling of long videos. In Proceedings of the NeurIPS, 2022.
  18. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. CoRR, abs/2211.13221, 2022.
  19. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the EMNLP, pages 7514–7528, 2021.
  20. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the NeurIPS, pages 6626–6637, 2017.
  21. Imagen video: High definition video generation with diffusion models. CoRR, abs/2210.02303, 2022.
  22. Denoising diffusion probabilistic models. In Proceedings of the NeurIPS, pages 6840–6851, 2020.
  23. Video diffusion models. In Proceedings of the NeurIPS, 2022.
  24. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. CoRR, abs/2205.15868, 2022.
  25. Diffusion models for video prediction and infilling. CoRR, abs/2206.07696, 2022.
  26. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the CVPR, pages 10233–10244, 2020.
  27. Image generation from scene graphs. In Proceedings of the CVPR, pages 1219–1228, 2018.
  28. Image retrieval using scene graphs. In Proceedings of the CVPR, pages 3668–3678, 2015.
  29. Lower dimensional kernels for video discriminators. Neural Networks, 132:506–520, 2020.
  30. Video pixel networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the ICML, pages 1771–1779, 2017.
  31. Dense-captioning events in videos. In Proceedings of the ICCV, pages 706–715, 2017.
  32. Videoflow: A conditional flow-based model for stochastic video generation. In Proceedings of the ICLR, 2020.
  33. Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. arXiv preprint arXiv:2309.00398, 2023.
  34. Video generation from text. In Proceedings of the AAAI, pages 7065–7072, 2018.
  35. Dynamic scene graph generation via anticipatory pre-training. In Proceedings of the CVPR, pages 13864–13873, 2022.
  36. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091, 2023.
  37. Ed-t2v: An efficient training framework for diffusion-based text-to-video generation. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2023.
  38. Decomposed diffusion models for high-quality video generation. arXiv preprint arXiv:2303.08320, 2023.
  39. Videofusion: Decomposed diffusion models for high-quality video generation. CoRR, abs/2303.08320, 2023.
  40. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1506–1515, 2017.
  41. VIDM: video implicit diffusion models. CoRR, abs/2212.00235, 2022.
  42. Temporal shift GAN for large scale video generation. In Proceedings of the WACV, pages 3178–3187, 2021.
  43. Conditional image-to-video generation with latent flow diffusion models. CoRR, abs/2303.13744, 2023.
  44. Recurrent space-time graph neural networks. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Proceedings of the NeurIPS, pages 12818–12830, 2019.
  45. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  46. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  47. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 643–654, 2023.
  48. Learning transferable visual models from natural language supervision. In Proceedings of the ICML, pages 8748–8763, 2021.
  49. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  50. Faster R-CNN: towards real-time object detection with region proposal networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, pages 91–99, 2015.
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  52. High-resolution image synthesis with latent diffusion models. In Proceedings of the CVPR, pages 10674–10685, 2022.
  53. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal GAN. Int. J. Comput. Vis., 128(10):2586–2606, 2020.
  54. Improved techniques for training gans. In Proceedings of the NeurIPS, pages 2226–2234, 2016.
  55. Make-a-video: Text-to-video generation without text-video data. CoRR, abs/2209.14792, 2022.
  56. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the CVPR, pages 3616–3626, 2022.
  57. Denoising diffusion implicit models. In Proceedings of the ICLR, 2021.
  58. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
  59. Unbiased scene graph generation from biased training. In Proceedings of the CVPR, pages 3713–3722, 2020.
  60. DF-GAN: deep fusion generative adversarial networks for text-to-image synthesis. CoRR, abs/2008.05865, 2020.
  61. A good image generator is what you need for high-resolution video synthesis. In Proceedings of the ICLR, 2021.
  62. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  63. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the ICCV, pages 4489–4497, 2015.
  64. Towards accurate generative models of video: A new metric & challenges. CoRR, abs/1812.01717, 2018.
  65. Attention is all you need. In Proceedings of the NeurIPS, pages 5998–6008, 2017.
  66. Graph attention networks. In Proceedings of the ICLR, 2018.
  67. Generating videos with scene dynamics. In Proceedings of the NeurIPS, pages 613–621, 2016.
  68. Chatgpt empowered long-step robot control in various environments: A case application. CoRR, abs/2304.03893, 2023.
  69. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  70. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023.
  71. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023.
  72. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  73. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  74. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022.
  75. Scaling autoregressive video models. In Proceedings of the ICLR, 2020.
  76. Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671, 2023.
  77. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  78. Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. Advances in Neural Information Processing Systems, 36, 2024.
  79. Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023.
  80. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the CVPR, pages 5288–5296, 2016.
  81. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the CVPR, pages 1316–1324, 2018.
  82. Videogpt: Video generation using VQ-VAE and transformers. CoRR, abs/2104.10157, 2021.
  83. Diffusion-based scene graph to image generation with masked contrastive pre-training. CoRR, abs/2211.11138, 2022.
  84. Image-to-image retrieval by learning similarity between scene graphs. In Proceedings of the AAAI, pages 10718–10726, 2021.
  85. Video probabilistic diffusion models in projected latent space. CoRR, abs/2302.07685, 2023.
  86. Generating videos with dynamics-aware implicit generative adversarial networks. In Proceedings of the ICLR, 2022.
  87. Markov decision process for video generation. In Proceedings of the ICCV, pages 1523–1532, 2019.
  88. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  89. Multi-scale 2d temporal adjacency networks for moment localization with natural language. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9073–9087, 2021.
  90. Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465, 2023.
  91. Magicvideo: Efficient video generation with latent diffusion models. CoRR, abs/2211.11018, 2022.
  92. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the CVPR, pages 5802–5810, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hao Fei (105 papers)
  2. Shengqiong Wu (36 papers)
  3. Wei Ji (202 papers)
  4. Hanwang Zhang (161 papers)
  5. Tat-Seng Chua (360 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.