Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Long Video Generation: Challenges, Methods, and Prospects (2403.16407v1)

Published 25 Mar 2024 in cs.CV

Abstract: Video generation is a rapidly advancing research area, garnering significant attention due to its broad range of applications. One critical aspect of this field is the generation of long-duration videos, which presents unique challenges and opportunities. This paper presents the first survey of recent advancements in long video generation and summarises them into two key paradigms: divide and conquer temporal autoregressive. We delve into the common models employed in each paradigm, including aspects of network design and conditioning techniques. Furthermore, we offer a comprehensive overview and classification of the datasets and evaluation metrics which are crucial for advancing long video generation research. Concluding with a summary of existing studies, we also discuss the emerging challenges and future directions in this dynamic field. We hope that this survey will serve as an essential reference for researchers and practitioners in the realm of long video generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  3. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35:31769–31781, 2022.
  4. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  5. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023.
  6. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023.
  7. Mv-diffusion: Motion-aware video diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7255–7263, 2023.
  8. Self-supervised visual planning with temporal skip connections. CoRL, 12:16, 2017.
  9. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
  10. Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv preprint arXiv:2309.03549, 2023.
  11. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
  12. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  13. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  14. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
  15. Video generation beyond a single clip. arXiv preprint arXiv:2304.07483, 2023.
  16. Scope of validity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801, 2008.
  17. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  18. Nuwa-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. Advances in Neural Information Processing Systems, 35:15420–15432, 2022.
  19. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  20. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning. arXiv preprint arXiv:2309.15091, 2023.
  21. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  22. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2022.
  23. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  24. Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3563–3573, 2022.
  25. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  26. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 32–36. IEEE, 2004.
  27. First order motion model for image animation. Advances in neural information processing systems, 32, 2019.
  28. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3626–3636, 2022.
  29. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2(11), 2012.
  30. Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16773–16783, 2023.
  31. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  32. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399, 2022.
  33. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems, 35:23371–23385, 2022.
  34. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  35. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2364–2373, 2018.
  36. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  37. Enabling the encoder-empowered gan-based video generators for long video generation. In 2023 IEEE International Conference on Image Processing (ICIP), pages 1425–1429. IEEE, 2023.
  38. Video diffusion models with local-global context guidance. arXiv preprint arXiv:2306.02562, 2023.
  39. Openai.com. Sora. Sora, 2024.
  40. Nuwa-xl: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
  41. Towards end-to-end generative modeling of long videos with memory-efficient bidirectional transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22888–22897, 2023.
  42. Generating videos with dynamics-aware implicit generative adversarial networks. arXiv preprint arXiv:2202.10571, 2022.
  43. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023.
  44. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
  45. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  46. Vlogger: Make your dream a vlog. arXiv preprint arXiv:2401.09414, 2024.
  47. Denoising diffusion probabilistic models. Advances in neural information processing systems, pages 6840–6851, 2020.
  48. Generative adversarial networks: An overview Communications of the ACM, pages 139–144, 2020.
  49. Zeyu Lu and Zidong Wang and Di Huang and Chengyue Wu and Xihui Liu and Wanli Ouyangand Lei Bai. FiT: Flexible Vision Transformer for Diffusion Model. arXiv preprint arXiv:2402.12376, 2024.
Citations (9)

Summary

We haven't generated a summary for this paper yet.