Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Dual-Stream Diffusion Net for Text-to-Video Generation (2308.08316v3)

Published 16 Aug 2023 in cs.CV

Abstract: With the emerging diffusion models, recently, text-to-video generation has aroused increasing attention. But an important bottleneck therein is that generative videos often tend to carry some flickers and artifacts. In this work, we propose a dual-stream diffusion net (DSDN) to improve the consistency of content variations in generating videos. In particular, the designed two diffusion streams, video content and motion branches, could not only run separately in their private spaces for producing personalized video variations as well as content, but also be well-aligned between the content and motion domains through leveraging our designed cross-transformer interaction module, which would benefit the smoothness of generated videos. Besides, we also introduce motion decomposer and combiner to faciliate the operation on video motion. Qualitative and quantitative experiments demonstrate that our method could produce amazing continuous videos with fewer flickers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  2. Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22563–22575.
  3. Civitai. 2022. Civitai. https://civitai.com/.
  4. Adversarial Video Generation on Complex Datasets. arXiv preprint arXiv:1907.06571.
  5. Generative adversarial networks. Communications of the ACM.
  6. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. arXiv preprint arXiv:2307.04725.
  7. Flexible Diffusion Modeling of Long Videos. arXiv preprint arXiv:2205.11495.
  8. Denoising Diffusion Probabilistic Models. Neural Information Processing Systems (NeurIPS).
  9. Video Diffusion Models. arXiv preprint arXiv:2204.03458.
  10. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. arXiv preprint arXiv:2205.15868.
  11. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685.
  12. STM: SpatioTemporal and Motion Encoding for Action Recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2000–2009.
  13. Video Pixel Networks. In Proceedings of the 34th International Conference on Machine Learning, 1771–1779.
  14. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  15. Conditional Image-to-Video Generation With Latent Flow Diffusion Models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18444–18455.
  16. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
  17. Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning and PMLR, 8748–8763.
  18. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 5485–5551.
  19. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125.
  20. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
  21. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 36479–36494.
  22. LAION-5B: A new era of open large-scale multi-modal datasets. arXiv preprint arXiv:2307.04725.
  23. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792.
  24. A Short Note on the Kinetics-700-2020 Human Action Dataset. arXiv:2010.10864.
  25. Denoising Diffusion Implicit Models. arXiv preprint arXiv:2010.02502.
  26. Neural Discrete Representation Learning. Advances in Neural Information Processing Systems.
  27. Generating Videos with Scene Dynamics. arXiv preprint arXiv:1609.02612.
  28. Advancing high-resolution videolanguage representation with large-scale video transcriptions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5036–5045.
  29. Video Probabilistic Diffusion Models in Projected Latent Space. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18456–18466.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.