Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation (2311.18829v2)

Published 30 Nov 2023 in cs.CV

Abstract: We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly, MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image&text-to-video generation. This strategy offers two significant advantages. a) It allows us to take full advantage of the recent advances in text-to-image models, such as Stable Diffusion, Midjourney, and DALLE, to generate photorealistic and highly detailed images. b) Leveraging the generated image, the model can allocate less focus to fine-grained appearance details, prioritizing the efficient learning of motion dynamics. To implement this strategy effectively, we introduce two core designs. First, we propose the Appearance Injection Network, enhancing the preservation of the appearance of the given image. Second, we introduce the Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities of pre-trained 2D diffusion models. These design elements empower MicroCinema to generate high-quality videos with precise motion, guided by the provided text prompts. Extensive experiments demonstrate the superiority of the proposed framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT. See https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples.

The paper presents a two-stage framework for text-to-video generation that decomposes the problem into a text-to-image stage and an image–text-to-video stage, thereby leveraging existing high-fidelity text-to-image (T2I) models to produce a key-frame image. This image then conditions a dedicated video synthesis module, which is based on an extended 3D diffusion architecture. The proposed approach effectively decouples the challenges of detailed appearance modeling from motion dynamics, enabling improved temporal coherence and appearance preservation.

Key technical contributions include:

  • Two-Stage Divide-and-Conquer Pipeline:
    • In the first stage, an off-the-shelf T2I model (e.g., SDXL or Stable Diffusion 2.1) generates a high-quality key-frame image.
    • In the second stage, the image–text-to-video generator takes both the key-frame and the provided textual prompt as conditioning inputs to synthesize a video sequence. This strategy significantly reduces the burden of explicitly learning fine-grained appearance details from scratch in a video generation model.
  • 3D Diffusion Model Architecture:
    • The model extends the standard 2D U-Net used in diffusion models into a 3D structure by interleaving 1D temporal convolutional layers and 1D temporal attention layers after the spatial layers.
    • Temporal layers are initialized to zero and incorporated via a skip connection that preserves the pre-trained spatial module’s learned representation, thereby ensuring stable fine-tuning for video synthesis.
  • Appearance Injection Network (AppearNet):
    • AppearNet enhances the conditioning of the generated video on the key frame by extracting multi-scale features from a replicated version of the key-frame latent and injecting them densely into both encoder and decoder branches of the main U-Net.
    • A de-normalization strategy, inspired by SPADE (Spatially-Adaptive Denormalization), is employed to fuse these features within the Group Normalization operations, which improves the coherence of appearance across generated frames.
  • Appearance Noise Prior:
    • Recognizing that the conventional diffusion process relies on i.i.d. Gaussian noise, the paper introduces an appearance-aware modification where a fraction (controlled by a coefficient λ) of the key-frame latent is added to the noise vector.
    • The revised noise is given by ϵ=λzc+ϵn\bm{\epsilon} = \lambda\, \bm{z}^c + \bm{\epsilon}_{n}, where zc\bm{z}^c is the center frame latent and ϵn\bm{\epsilon}_{n} is drawn from N(0,I)\mathcal{N}(0, \bm{I}).
    • This adaptation preserves the appearance information from the key frame during both training and inference while enabling the diffusion model to focus more on learning motion dynamics. Extensive ablation studies validate that the optimal configuration (λ = 0.03 and additional inference offset γ = 0.02) yields significant improvements, reducing the Frechet Video Distance (FVD) and increasing the Inception Score (IS).
  • Temporal Interpolation and Super-Resolution:
    • The base image–text-to-video model produces low frame rate outputs (e.g., 2 fps, 9 frames at 320×320 resolution). A temporal interpolation model is then employed to increase the frame rate (up to 32 fps) through latent-space interpolation, while an off-the-shelf spatial super-resolution module enhances the visual fidelity of the final output.
  • Empirical Performance and Ablation Studies:
    • The method achieves state-of-the-art zero-shot performance with an FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT.
    • Detailed ablation studies compare several methods for integrating appearance information (direct concatenation, decoder-only injection, encoder–decoder fusion, and SPADE-based fusion), demonstrating that the additive injection into both encoder and decoder with de-normalization yields superior results.
    • Further analysis shows that the proposed appearance noise prior not only improves appearance consistency but also enhances efficiency, as it allows the model to produce satisfactory outputs with a reduced number of sampling steps.

In summary, the paper introduces a well-founded divide-and-conquer strategy for text-to-video generation that effectively decouples appearance synthesis from motion modeling. By capitalizing on the strengths of pre-trained T2I models and incorporating novel conditioning mechanisms—namely, the Appearance Injection Network and the Appearance Noise Prior—the framework significantly advances the generation of coherent videos with high-fidelity appearances and accurate motion.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Yanhui Wang (13 papers)
  2. Jianmin Bao (65 papers)
  3. Wenming Weng (7 papers)
  4. Ruoyu Feng (16 papers)
  5. Dacheng Yin (13 papers)
  6. Tao Yang (520 papers)
  7. Jingxu Zhang (3 papers)
  8. Qi Dai Zhiyuan Zhao (1 paper)
  9. Chunyu Wang (43 papers)
  10. Kai Qiu (19 papers)
  11. Yuhui Yuan (42 papers)
  12. Chuanxin Tang (13 papers)
  13. Xiaoyan Sun (46 papers)
  14. Chong Luo (58 papers)
  15. Baining Guo (53 papers)
Citations (7)
Github Logo Streamline Icon: https://streamlinehq.com