Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GVDIFF: Grounded Text-to-Video Generation with Diffusion Models (2407.01921v2)

Published 2 Jul 2024 in cs.CV

Abstract: In text-to-video (T2V) generation, significant attention has been directed toward its development, yet unifying discrete and continuous grounding conditions in T2V generation remains under-explored. This paper proposes a Grounded text-to-Video generation framework, termed GVDIFF. First, we inject the grounding condition into the self-attention through an uncertainty-based representation to explicitly guide the focus of the network. Second, we introduce a spatial-temporal grounding layer that connects the grounding condition with target objects and enables the model with the grounded generation capacity in the spatial-temporal domain. Third, our dynamic gate network adaptively skips the redundant grounding process to selectively extract grounding information and semantics while improving efficiency. We extensively evaluate the grounded generation capacity of GVDIFF and demonstrate its versatility in applications, including long-range video generation, sequential prompts, and object-specific editing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
  2. Text2live: Text-driven layered image and video editing. In ECCV, pages 707–723, 2022.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
  4. A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226, 2023.
  5. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  6. Stablevideo: Text-driven consistency-aware diffusion video editing. In ICCV, pages 23040–23050, 2023.
  7. Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023a.
  8. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  9. Wavegrad 2: Iterative refinement for text-to-speech synthesis. arXiv preprint arXiv:2106.09660, 2021.
  10. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023b.
  11. Executing your commands via motion diffusion in latent space. In CVPR, pages 18000–18010, 2023c.
  12. Civitai. Civitai. https://civitai.com/, 2022.
  13. Zero-shot spatial layout conditioning for text-to-image diffusion models. In ICCV, pages 2174–2183, 2023a.
  14. Videdit: Zero-shot and spatially aware text-driven video editing. arXiv preprint arXiv:2306.08707, 2023b.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. Cogview: Mastering text-to-image generation via transformers. NeurIPS, 34:19822–19835, 2021.
  17. Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
  18. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In CVPR, pages 10135–10145, 2023.
  19. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
  20. Generative adversarial nets. NeurIPS, 27, 2014.
  21. Vector quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022.
  22. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  23. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  24. Eigengan: Layer-wise eigen-learning for gans. In ICCV, pages 14408–14417, 2021.
  25. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  26. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  27. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  28. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  29. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
  30. High-resolution complex scene synthesis with transformers. arXiv preprint arXiv:2105.06458, 2021.
  31. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
  32. Layered neural atlases for consistent video editing. ACM TOG, 40(6):1–12, 2021.
  33. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  34. Segment anything. In ICCV, pages 4015–4026, 2023.
  35. Diffwave: A versatile diffusion model for audio synthesis. In ICLR, 2021.
  36. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  37. Upainting: Unified text-to-image diffusion generation with cross-modal guidance. arXiv preprint arXiv:2210.16031, 2022.
  38. Gligen: Open-set grounded text-to-image generation. In CVPR, pages 22511–22521, 2023b.
  39. Image synthesis from layout with locality-aware mask adaption. In ICCV, pages 13819–13828, 2021.
  40. Magicedit: High-fidelity and temporally coherent video editing. In arXiv, 2023.
  41. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  42. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, pages 10209–10218, 2023.
  43. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421. Springer, 2020.
  44. Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
  45. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  46. Conditional image-to-video generation with latent flow diffusion models. In CVPR, pages 18444–18455, 2023.
  47. Towards bridging semantic gap to improve semantic segmentation. In ICCV, 2019.
  48. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2018.
  49. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  50. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. ACM MM, 2023.
  51. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  52. Generative adversarial text to image synthesis. In ICML, pages 1060–1069. PMLR, 2016.
  53. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  54. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241, 2015.
  55. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  56. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
  57. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  58. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265, 2015.
  59. Referring expression comprehension using language adaptive inference. In AAAI, 2023a.
  60. Language adaptive weight generation for multi-task visual grounding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10857–10866, 2023b.
  61. Scanformer: Referring expression comprehension by iteratively scanning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13449–13458, 2024.
  62. Learning layout and style reconfigurable gans for controllable image synthesis. IEEE TPAMI, 44(9):5070–5087, 2021.
  63. Object-centric image generation from layouts. In AAAI, pages 2647–2655, 2021.
  64. Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023.
  65. Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023a.
  66. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b.
  67. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
  68. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, pages 720–736, 2022.
  69. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023a.
  70. Multi-object video generation from single frame layouts. arXiv preprint arXiv:2305.03983, 2023b.
  71. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–1324, 2018.
  72. Modeling image composition for complex scene generation. In CVPR, pages 7764–7773, 2022.
  73. Reco: Region-controlled text-to-image generation. In CVPR, pages 14246–14255, 2023.
  74. Video probabilistic diffusion models in projected latent space. In CVPR, 2023.
  75. Semanticmim: Marring masked image modeling with semantics compression for general visual representation, 2024.
  76. Mini-dalle3: Interactive text to image by prompting large language models. arXiv preprint arXiv:2310.07653, 2023.
  77. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv preprint arXiv:2303.11717, 2023a.
  78. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, pages 5907–5915, 2017.
  79. Glipv2: Unifying localization and vision-language understanding. NeurIPS, 35:36067–36080, 2022a.
  80. Contrastive deep supervision. In ECCV, 2022b.
  81. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
  82. Controllable text-to-image generation with gpt-4. arXiv preprint arXiv:2305.18583, 2023c.
  83. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023d.
  84. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022a.
  85. Towards language-free training for text-to-image generation. In CVPR, pages 17907–17917, 2022b.
  86. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, pages 5802–5810, 2019.
  87. Discrete autoencoders for sequence models. arXiv preprint arXiv:1801.09797, 2018.
Citations (1)

Summary

We haven't generated a summary for this paper yet.