The paper presents a two-stage framework for text-to-video generation that decomposes the problem into a text-to-image stage and an image–text-to-video stage, thereby leveraging existing high-fidelity text-to-image (T2I) models to produce a key-frame image. This image then conditions a dedicated video synthesis module, which is based on an extended 3D diffusion architecture. The proposed approach effectively decouples the challenges of detailed appearance modeling from motion dynamics, enabling improved temporal coherence and appearance preservation.
Key technical contributions include:
- Two-Stage Divide-and-Conquer Pipeline:
- In the first stage, an off-the-shelf T2I model (e.g., SDXL or Stable Diffusion 2.1) generates a high-quality key-frame image.
- In the second stage, the image–text-to-video generator takes both the key-frame and the provided textual prompt as conditioning inputs to synthesize a video sequence. This strategy significantly reduces the burden of explicitly learning fine-grained appearance details from scratch in a video generation model.
- 3D Diffusion Model Architecture:
- The model extends the standard 2D U-Net used in diffusion models into a 3D structure by interleaving 1D temporal convolutional layers and 1D temporal attention layers after the spatial layers.
- Temporal layers are initialized to zero and incorporated via a skip connection that preserves the pre-trained spatial module’s learned representation, thereby ensuring stable fine-tuning for video synthesis.
- Appearance Injection Network (AppearNet):
- AppearNet enhances the conditioning of the generated video on the key frame by extracting multi-scale features from a replicated version of the key-frame latent and injecting them densely into both encoder and decoder branches of the main U-Net.
- A de-normalization strategy, inspired by SPADE (Spatially-Adaptive Denormalization), is employed to fuse these features within the Group Normalization operations, which improves the coherence of appearance across generated frames.
- Appearance Noise Prior:
- Recognizing that the conventional diffusion process relies on i.i.d. Gaussian noise, the paper introduces an appearance-aware modification where a fraction (controlled by a coefficient λ) of the key-frame latent is added to the noise vector.
- The revised noise is given by , where is the center frame latent and is drawn from .
- This adaptation preserves the appearance information from the key frame during both training and inference while enabling the diffusion model to focus more on learning motion dynamics. Extensive ablation studies validate that the optimal configuration (λ = 0.03 and additional inference offset γ = 0.02) yields significant improvements, reducing the Frechet Video Distance (FVD) and increasing the Inception Score (IS).
- Temporal Interpolation and Super-Resolution:
- The base image–text-to-video model produces low frame rate outputs (e.g., 2 fps, 9 frames at 320×320 resolution). A temporal interpolation model is then employed to increase the frame rate (up to 32 fps) through latent-space interpolation, while an off-the-shelf spatial super-resolution module enhances the visual fidelity of the final output.
- Empirical Performance and Ablation Studies:
- The method achieves state-of-the-art zero-shot performance with an FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT.
- Detailed ablation studies compare several methods for integrating appearance information (direct concatenation, decoder-only injection, encoder–decoder fusion, and SPADE-based fusion), demonstrating that the additive injection into both encoder and decoder with de-normalization yields superior results.
- Further analysis shows that the proposed appearance noise prior not only improves appearance consistency but also enhances efficiency, as it allows the model to produce satisfactory outputs with a reduced number of sampling steps.
In summary, the paper introduces a well-founded divide-and-conquer strategy for text-to-video generation that effectively decouples appearance synthesis from motion modeling. By capitalizing on the strengths of pre-trained T2I models and incorporating novel conditioning mechanisms—namely, the Appearance Injection Network and the Appearance Noise Prior—the framework significantly advances the generation of coherent videos with high-fidelity appearances and accurate motion.