Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vlogger: Make Your Dream A Vlog

Published 17 Jan 2024 in cs.CV, cs.AI, cs.LG, and cs.MM | (2401.09414v1)

Abstract: In this work, we present Vlogger, a generic AI system for generating a minute-level video blog (i.e., vlog) of user descriptions. Different from short videos with a few seconds, vlog often contains a complex storyline with diversified scenes, which is challenging for most existing video generation approaches. To break through this bottleneck, our Vlogger smartly leverages LLM as Director and decomposes a long video generation task of vlog into four key stages, where we invoke various foundation models to play the critical roles of vlog professionals, including (1) Script, (2) Actor, (3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings, our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. Moreover, we introduce a novel video diffusion model, ShowMaker, which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts, it can effectively enhance spatial-temporal coherence in the snippet. Besides, we design a concise mixed training paradigm for ShowMaker, boosting its capacity for both T2V generation and prediction. Finally, the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs from open-world descriptions, without loss of video coherence on script and actor. The code and model is all available at https://github.com/zhuangshaobin/Vlogger.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Suno AI. bark. https://github.com/suno-ai/bark#-usage-in-python, 2023.
  2. Qwen technical report. ArXiv, abs/2309.16609, 2023.
  3. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  4. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. ArXiv, abs/2211.01324, 2022.
  5. Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, 2019.
  6. Improving image generation with better captions. 2023.
  7. Disentangling multiple features in video sequences using gaussian processes in variational autoencoders. In ECCV, 2020.
  8. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  9. Generating long videos of dynamic scenes. In NeurIPS, 2022.
  10. Language models are few-shot learners. In NeurIPS, 2020.
  11. Maskgit: Masked generative image transformer. In CVPR, 2022.
  12. The msr-video to text dataset with clean annotations. Comput. Vis. Image Underst., 225:103581, 2021.
  13. Long-term video prediction via criticization and retrospection. IEEE TIP, 29:7090–7103, 2020.
  14. Seine: Short-to-long video diffusion model for generative transition and prediction. ArXiv, abs/2310.20700, 2023.
  15. François Chollet. On the measure of intelligence. ArXiv, abs/1911.01547, 2019.
  16. Palm: Scaling language modeling with pathways. JMLR, 24:240:1–240:113, 2022.
  17. Adversarial video generation on complex datasets. arXiv: Computer Vision and Pattern Recognition, 2019.
  18. Emu: Enhancing image generation models using photogenic needles in a haystack. ArXiv, abs/2309.15807, 2023.
  19. Jointly trained image and video generation using residual vectors. WACV, pages 3017–3031, 2019.
  20. Taming transformers for high-resolution image synthesis. In CVPR, 2020.
  21. Tell me what happened: Unifying text-guided video completion via multimodal masked video generation. In CVPR, 2023.
  22. Vlogging: A survey of videoblogging technology on the web. ACM Computing Surveys, 2010.
  23. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022a.
  24. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022b.
  25. Preserve your own correlation: A noise prior for video diffusion models. ArXiv, abs/2305.10474, 2023.
  26. Visual programming: Compositional visual reasoning without training. In CVPR, 2023.
  27. Latent video diffusion models for high-fidelity long video generation. ArXiv, abs/2211.13221, 2023.
  28. Deep learning scaling is predictable, empirically. ArXiv, abs/1712.00409, 2017.
  29. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  30. Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
  31. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  32. Imagen video: High definition video generation with diffusion models. ArXiv, abs/2210.02303, 2022a.
  33. Video diffusion models. ArXiv, abs/2204.03458, 2022b.
  34. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
  35. Diffusion models for video prediction and infilling. TMLR, 2022, 2022.
  36. Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. ArXiv, abs/2309.14494, 2023.
  37. Openclip. https://github.com/mlfoundations/open_clip, 2021.
  38. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017.
  39. Matryoshka representation learning. In NeurIPS, 2022.
  40. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023.
  41. Video generation from text. In AAAI, 2017.
  42. Microsoft coco: Common objects in context. In ECCV, 2014.
  43. Visual instruction tuning. In NeurIPS, 2022.
  44. Decoupled weight decay regularization. In ICLR, 2017.
  45. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2021.
  46. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  47. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
  48. Sdxl: Improving latent diffusion models for high-resolution image synthesis. ArXiv, abs/2307.01952, 2023.
  49. Freenoise: Tuning-free longer video diffusion via noise rescheduling. ArXiv, abs/2310.15169, 2023.
  50. Learning transferable visual models from natural language supervision. In ICML, 2021.
  51. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
  52. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  53. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
  54. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  55. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv, abs/2111.02114, 2021.
  56. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  57. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  58. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, 2021.
  59. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, 2022.
  60. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  61. Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
  62. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. ArXiv, abs/2107.02137, 2021.
  63. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  64. Motion-based generator model: Unsupervised disentanglement of appearance, trackable and intrackable motions in dynamic patterns. In ICLR, 2019.
  65. A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
  66. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  67. Towards accurate generative models of video: A new metric & challenges. In ICLR, 2019.
  68. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023.
  69. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation. In NeurIPS, 2022.
  70. Gen-l-video: Multi-text to long video generation via temporal co-denoising. ArXiv, abs/2305.18264, 2023a.
  71. Modelscope text-to-video technical report. ArXiv, abs/2308.06571, 2023b.
  72. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. ArXiv, abs/2305.10874, 2023c.
  73. G3an: Disentangling appearance and motion for video generation. In CVPR, 2020a.
  74. Imaginator: Conditional spatio-temporal gan for video generation. WACV, pages 1149–1158, 2020b.
  75. Inmodegan: Interpretable motion decomposition generative adversarial network for video generation. ArXiv, abs/2101.03049, 2021.
  76. Internvideo: General video foundation models via generative and discriminative learning. ArXiv, abs/2212.03191, 2022.
  77. Lavie: High-quality video generation with cascaded latent diffusion models. ArXiv, abs/2309.15103, 2023d.
  78. Godiva: Generating open-domain videos from natural descriptions. ArXiv, abs/2104.14806, 2021.
  79. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022.
  80. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv, abs/2303.04671, 2023.
  81. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2021.
  82. Videogpt: Video generation using vq-vae and transformers. ArXiv, abs/2104.10157, 2021.
  83. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. ArXiv, abs/2308.06721, 2023.
  84. Nuwa-xl: Diffusion over diffusion for extremely long video generation. In Annual Meeting of the Association for Computational Linguistics, 2023.
  85. Magvit: Masked generative video transformer. In CVPR, 2023.
  86. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2022.
  87. Glm-130b: An open bilingual pre-trained model. ArXiv, abs/2210.02414, 2022.
  88. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. ArXiv, abs/2309.15818, 2023a.
  89. Controllable text-to-image generation with gpt-4. ArXiv, abs/2305.18583, 2023b.
  90. Magicvideo: Efficient video generation with latent diffusion models. ArXiv, abs/2211.11018, 2022.
  91. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023a.
  92. Moviefactory: Automatic movie creation from text using large generative models for language and images. In ACM MM, 2023b.
Citations (22)

Summary

  • The paper's main contribution is introducing a four-stage AI framework that decomposes long-form video generation into script creation, actor design, video shooting, and dubbing.
  • It employs innovative top-down planning with LLM dialogue and bottom-up shooting through a spatial-temporal diffusion model to ensure coherent scene transitions.
  • Experimental results on benchmarks like UCF-101 and Kinetics-400 demonstrate superior performance with lower FVD and enhanced CLIP scores compared to previous methods.

Vlogger: Make Your Dream A Vlog

Introduction

The paper "Vlogger: Make Your Dream A Vlog" (2401.09414) introduces a comprehensive AI system designed for generating complex minute-level video blogs (vlogs) from user descriptions. Unlike traditional short video generation methods that produce simple, short clips, vlogs require intricate narratives with diverse scenes. The Vlogger framework innovatively utilizes a LLM to decompose the video generation task into essential stages, engaging various foundational models that mimic the roles of film production professionals.

Methodology

The Vlogger framework operates in four key stages, each executed by distinct components acting as vlog professionals: Script, Actor, ShowMaker, and Voicer. The system's unique approach involves a combination of top-down planning and bottom-up shooting strategies.

Top-Down Planning

The top-down planning phase begins with the LLM acting as a Director. Given a user story, the Director decomposes the video generation task into a series of scripted scenes using four rounds of dialogue to gradually refine the script. Figure 1

Figure 1: Top-Down Planning converts a user story into a final script through dialogue with the LLM.

  • Script Creation: The LLM Director uses a progressive script creation paradigm that translates user stories into detailed scripts, breaking down the content into individual scenes with designated durations.
  • Actor Design: Subsequent to script creation, the LLM extracts actor roles from the script and invokes a character designer for generating actor images. It also assigns actors to specific scenes based on script analysis.

Bottom-Up Shooting

With a detailed script and actor references, the system transitions to the bottom-up shooting phase, primarily driven by the ShowMaker model to produce video snippets for each scene. Figure 2

Figure 2: Bottom-Up Shooting generates video snippets with script and actor coherence using ShowMaker.

  • ShowMaker Shooting: ShowMaker serves as a videographer, employing a novel video diffusion model that integrates actor images and script descriptions as prompts for spatial-temporal coherence. It uniquely supports variable snippet durations through combined generation and prediction modes during inference.
  • Voicer Dubbing: Finally, the Voicer model, such as Bark, uses the script to generate audio dubbing, which is synchronized with the video snippets to complete the vlog.

ShowMaker: Design and Training

ShowMaker is central to the Vlogger system, introduced as a new video diffusion model incorporating two distinct features: the Spatial-Temporal Enhanced Block (STEB) and a mixed training paradigm. Figure 3

Figure 3: (a) ShowMaker's architecture, (b) STEB for actor and script coherence enhancement, (c) Mixed training paradigm for T2V generation.

  • Spatial-Temporal Enhanced Block (STEB): The STEB enhances video coherence by employing spatial cross attention with actor images and temporal cross attention with script descriptions, ensuring consistency across frames.
  • Mixed Training Paradigm: Mode selection integrates generation and prediction, leveraging probabilistic selection of masked frames to improve both Text-to-Video (T2V) generation and prediction capabilities.

Experimental Results

Extensive experiments demonstrate Vlogger's superiority in long video generation by achieving state-of-the-art performance on zero-shot T2V tasks. It surpasses existing methods such as Phenaki with fewer training resources while maintaining coherence over extended durations.

  • Performance Metrics: The experiments conducted include comparisons on datasets like UCF-101, Kinetics-400, and MSR-VTT, where Vlogger achieves lower Fréchet Video Distance (FVD) and improved CLIP similarity scores, affirming its efficacy.
  • Visual Comparisons: Qualitative analyses illustrate that Vlogger generates more diverse and coherent video content compared to other approaches. Figure 4

    Figure 4: Comparison with state-of-the-art methods on long video generation shows Vlogger's superior performance.

    Figure 5

    Figure 5: Qualitative ablation for STEB and training paradigm illustrates the impact of the proposed enhancements.

Conclusion

The Vlogger system offers a substantial advancement in autonomous video blog generation by efficiently managing the complexity inherent in long video formats. It effectively blends top-down and bottom-up methodologies, ensuring coherent narrative expression in the generated vlogs. Future work may focus on improving actor realism and exploring new domains for script and actor representations, further bridging the gap between AI-generated content and human production standards.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 82 likes about this paper.