Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vlogger: Make Your Dream A Vlog (2401.09414v1)

Published 17 Jan 2024 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: In this work, we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches. To break through this bottleneck, our Vlogger smartly leverages LLM as Director and decomposes a long video generation task of vlog into four key stages, where we invoke various foundation models to play the critical roles of vlog professionals, including (1) Script, (2) Actor, (3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings, our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. Moreover, we introduce a novel video diffusion model, ShowMaker, which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts, it can effectively enhance spatial-temporal coherence in the snippet. Besides, we design a concise mixed training paradigm for ShowMaker, boosting its capacity for both T2V generation and prediction. Finally, the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor. The code and model is all available at https://github.com/zhuangshaobin/Vlogger.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Suno AI. bark. https://github.com/suno-ai/bark#-usage-in-python, 2023.
  2. Qwen technical report. ArXiv, abs/2309.16609, 2023.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  4. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. ArXiv, abs/2211.01324, 2022.
  5. Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, 2019.
  6. Improving image generation with better captions. 2023.
  7. Disentangling multiple features in video sequences using gaussian processes in variational autoencoders. In ECCV, 2020.
  8. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  9. Generating long videos of dynamic scenes. In NeurIPS, 2022.
  10. Language models are few-shot learners. In NeurIPS, 2020.
  11. Maskgit: Masked generative image transformer. In CVPR, 2022.
  12. The msr-video to text dataset with clean annotations. Comput. Vis. Image Underst., 225:103581, 2021.
  13. Long-term video prediction via criticization and retrospection. IEEE TIP, 29:7090–7103, 2020.
  14. Seine: Short-to-long video diffusion model for generative transition and prediction. ArXiv, abs/2310.20700, 2023.
  15. François Chollet. On the measure of intelligence. ArXiv, abs/1911.01547, 2019.
  16. Palm: Scaling language modeling with pathways. JMLR, 24:240:1–240:113, 2022.
  17. Adversarial video generation on complex datasets. arXiv: Computer Vision and Pattern Recognition, 2019.
  18. Emu: Enhancing image generation models using photogenic needles in a haystack. ArXiv, abs/2309.15807, 2023.
  19. Jointly trained image and video generation using residual vectors. WACV, pages 3017–3031, 2019.
  20. Taming transformers for high-resolution image synthesis. In CVPR, 2020.
  21. Tell me what happened: Unifying text-guided video completion via multimodal masked video generation. In CVPR, 2023.
  22. Vlogging: A survey of videoblogging technology on the web. ACM Computing Surveys, 2010.
  23. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022a.
  24. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022b.
  25. Preserve your own correlation: A noise prior for video diffusion models. ArXiv, abs/2305.10474, 2023.
  26. Visual programming: Compositional visual reasoning without training. In CVPR, 2023.
  27. Latent video diffusion models for high-fidelity long video generation. ArXiv, abs/2211.13221, 2023.
  28. Deep learning scaling is predictable, empirically. ArXiv, abs/1712.00409, 2017.
  29. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  30. Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
  31. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  32. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022a.
  33. Video diffusion models. ArXiv, abs/2204.03458, 2022b.
  34. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
  35. Diffusion models for video prediction and infilling. TMLR, 2022, 2022.
  36. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. ArXiv, abs/2309.14494, 2023.
  37. Openclip. https://github.com/mlfoundations/open_clip, 2021.
  38. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017.
  39. Matryoshka representation learning. In NeurIPS, 2022.
  40. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023.
  41. Video generation from text. In AAAI, 2017.
  42. Microsoft coco: Common objects in context. In ECCV, 2014.
  43. Visual instruction tuning. In NeurIPS, 2022.
  44. Decoupled weight decay regularization. In ICLR, 2017.
  45. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2021.
  46. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  47. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
  48. Sdxl: Improving latent diffusion models for high-resolution image synthesis. ArXiv, abs/2307.01952, 2023.
  49. Freenoise: Tuning-free longer video diffusion via noise rescheduling. ArXiv, abs/2310.15169, 2023.
  50. Learning transferable visual models from natural language supervision. In ICML, 2021.
  51. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
  52. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  53. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
  54. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  55. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv, abs/2111.02114, 2021.
  56. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  57. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  58. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, 2021.
  59. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, 2022.
  60. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  61. Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
  62. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. ArXiv, abs/2107.02137, 2021.
  63. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  64. Motion-based generator model: Unsupervised disentanglement of appearance, trackable and intrackable motions in dynamic patterns. In ICLR, 2019.
  65. A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
  66. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  67. Towards accurate generative models of video: A new metric & challenges. In ICLR, 2019.
  68. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023.
  69. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation. In NeurIPS, 2022.
  70. Gen-l-video: Multi-text to long video generation via temporal co-denoising. ArXiv, abs/2305.18264, 2023a.
  71. Modelscope text-to-video technical report. ArXiv, abs/2308.06571, 2023b.
  72. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. ArXiv, abs/2305.10874, 2023c.
  73. G3an: Disentangling appearance and motion for video generation. In CVPR, 2020a.
  74. Imaginator: Conditional spatio-temporal gan for video generation. WACV, pages 1149–1158, 2020b.
  75. Inmodegan: Interpretable motion decomposition generative adversarial network for video generation. ArXiv, abs/2101.03049, 2021.
  76. Internvideo: General video foundation models via generative and discriminative learning. ArXiv, abs/2212.03191, 2022.
  77. Lavie: High-quality video generation with cascaded latent diffusion models. ArXiv, abs/2309.15103, 2023d.
  78. Godiva: Generating open-domain videos from natural descriptions. ArXiv, abs/2104.14806, 2021.
  79. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022.
  80. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv, abs/2303.04671, 2023.
  81. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2021.
  82. Videogpt: Video generation using vq-vae and transformers. ArXiv, abs/2104.10157, 2021.
  83. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. ArXiv, abs/2308.06721, 2023.
  84. Nuwa-xl: Diffusion over diffusion for extremely long video generation. In Annual Meeting of the Association for Computational Linguistics, 2023.
  85. Magvit: Masked generative video transformer. In CVPR, 2023.
  86. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2022.
  87. Glm-130b: An open bilingual pre-trained model. ArXiv, abs/2210.02414, 2022.
  88. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. ArXiv, abs/2309.15818, 2023a.
  89. Controllable text-to-image generation with gpt-4. ArXiv, abs/2305.18583, 2023b.
  90. Magicvideo: Efficient video generation with latent diffusion models. ArXiv, abs/2211.11018, 2022.
  91. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023a.
  92. Moviefactory: Automatic movie creation from text using large generative models for language and images. In ACM MM, 2023b.
Citations (22)

Summary

  • The paper presents Vlogger, an AI system that transforms simple stories into coherent vlogs using a four-stage process emulating human video production roles.
  • It integrates a large language model as a director and a novel diffusion model, ShowMaker, to generate visually consistent and temporally coherent video snippets.
  • Benchmark tests demonstrate Vlogger's exceptional zero-shot text-to-video generation and long-video prediction results, enabling vlogs that exceed five minutes.

Introducing Vlogger: AI-Driven Video Blog Creation

Unveiling Vlogger

Envision the ability to transform narrative descriptions into cohesive video blogs, or vlogs. The AI field takes a leap in this direction with the introduction of Vlogger, an AI system designed to convert user-provided stories into minute-long vlogs. Vlogs are distinguished from traditional videos by their complex storylines and length, presenting unique challenges in automated generation that Vlogger decisively addresses.

Systematic Vlog Generation

Vlogger approaches the intricacies of vlog creation by emulating the real-world video production process. It intelligently integrates a LLM as a 'Director', guiding the production through four distinct stages. This approach allows Vlogger to mimic the roles of scriptwriter, actor, videographer, and voice-over artist, usually fulfilled by humans in professional settings. These stages are:

  1. Script: Crafting a narrative script from the user story.
  2. Actor: Designing visual references for characters.
  3. ShowMaker: Generating individual video snippets with spatial-temporal coherence.
  4. Voicer: Adding voice dubbing that aligns with the created script.

The Role of ShowMaker

Central to the process is ShowMaker, a novel video diffusion model adept at producing video clips for each scene. ShowMaker maintains the coherence of the script and the consistency of actor appearances throughout the vlog. This model is trained using a unique mixed paradigm, improving its performance in both generating video from text descriptions (T2V) and predicting subsequent video frames.

Benchmarking Vlogger

When benchmarked against other models, Vlogger achieves unparalleled state-of-the-art performance in zero-shot T2V generation and long-video prediction tasks. Notably, Vlogger can generate vlogs exceeding five minutes from generalized descriptions, seamlessly maintaining narrative and visual flow without a substantial dataset for training.

Conclusion

Representing an evolution in video generation, Vlogger empowers end-users to populate the digital world with rich and coherent vlogs drawn from simple descriptions. Open-source access to the code and model stimulates further innovation in the field. With its intelligent design and superior performance, Vlogger stands out as an impressive achievement in AI-powered content creation, bridging the gap between textual imagination and visual storytelling.

Github Logo Streamline Icon: https://streamlinehq.com