PixelDance proposes a diffusion-model-based approach to high-dynamic video generation that augments text-to-video synthesis with dual image instructions—one for the first frame and one for the last frame—thereby addressing the shortcomings of text-only conditioning. This method explicitly incorporates image conditions into the latent diffusion framework to encourage richer and more temporally coherent motion dynamics.
The model architecture is built upon a latent diffusion model that leverages a modified 2D UNet extended to 3D through the integration of temporal convolution and self-attention layers. The text instruction is embedded using a pre-trained CLIP (Contrastive Language-Image Pretraining) encoder and injected via cross-attention modules, while the image instructions are encoded by a pre-trained Variational Autoencoder (VAE) to generate visual conditioning signals for the first and last frames. The conditioning is applied by concatenating the encoded features with the perturbed video latents, ensuring that both spatial details and temporal context are preserved.
Key components and innovations include:
- Dual Image Instruction Scheme:
- The first frame instruction, derived from ground-truth video frames, enforces fidelity to the initial scene, enabling the generation of visually detailed and contextually consistent video sequences.
- The last frame instruction guides the ending configuration without forcing an exact replication, achieved through techniques such as random selection from the last three frames during training, noise perturbation of the guidance features, and stochastic dropping (with a dropout probability of around 25%). This design prevents abrupt or mechanically repeated endings and instead promotes smooth temporal transitions.
- Adaptive Inference Strategy:
- PixelDance employs a two-phase denoising process during inference where the last frame instruction is applied only during the first τ denoising steps. After this phase, the instruction is dropped to foster better temporal consistency in the generated video.
- Classifier-free diffusion guidance is used to balance the contributions of text and image conditioning during sampling, allowing the approach to be robust to variations in user-provided drafts.
- Training on Public Data:
- The model is trained primarily on the WebVid-10M dataset, supplemented by approximately 500K watermark-free clips to mitigate the issue of watermarks in generated outputs.
- Additionally, the joint training on video-text and image-text data (from LAION-400M) enhances the model’s ability to generalize beyond limited visual descriptions, particularly when detailed text annotations are not feasible.
The empirical evaluations demonstrate that PixelDance substantially outperforms previous approaches on leading benchmarks. In particular, on the MSR-VTT dataset, it achieves an FVD (Fréchet Video Distance) score of 381 and a CLIPSIM score of 0.3125, marking significant improvements over prior models such as ModelScope (FVD of 550). On UCF-101, PixelDance exhibits superior performance across Inception Score (IS), Fréchet Inception Distance (FID), and FVD. These results underscore the model’s capacity to synthesize motion-rich scenes with intricate transitions and detailed dynamics, despite relying on relatively coarse textual descriptions.
Additional analyses include:
- Qualitative Analysis:
- When conditioned solely on text, the model can direct various motion parameters such as body movements, camera zoom, and rotations. However, the addition of the first frame instruction dramatically refines spatial details and enforces continuity across subsequent video clips.
- The last frame instruction is shown to be critical for modeling actions in out-of-domain scenarios or generating natural shot transitions. Visual comparisons indicate that without proper handling of the last frame guidance, videos tend to terminate abruptly.
- Ablation Studies:
- Omitting either the text or the last frame instruction results in degradation in video quality, as measured by increases in FVD and FID, demonstrating that the combined conditioning strategy is essential for managing the diverse dynamics present in high-dynamic video sequences.
- Long Video Generation:
- The autoregressive generation strategy, where the last frame of a previous clip is employed as the first frame instruction for the ensuing clip, enables the synthesis of long videos (e.g., 1024 frames) with smooth temporal variations and consistent visual narratives.
- Comparative evaluations show that PixelDance produces lower FVD scores and smoother transitions than both traditional autoregressive methods (such as TATS-AR and LVDM-AR) and hierarchical approaches.
- Extended Applications:
- The framework is adaptable to alternative image instructions. For instance, when conditioned with semantic maps, image sketches, human poses, or bounding boxes, the system retains its ability to produce temporally coherent and stylistically varied video outputs.
- Zero-shot video editing is also supported by transforming the editing task into an image editing problem, wherein modifications to the first and last frame instructions yield coherently edited video sequences.
In summary, PixelDance demonstrates that integrating image instructions for both the initial and terminal frames within a latent diffusion model framework significantly enhances the synthesis of high-dynamic, motion-rich videos. The combination of advanced conditioning techniques, adaptive sampling strategies, and joint training on diverse datasets supports state-of-the-art performance on standard benchmarks, offering a promising direction for future research in controllable video synthesis and long video generation.