Conditional Image-to-Video Generation with Latent Flow Diffusion Models (2303.13744v1)

Published 24 Mar 2023 in cs.CV

Abstract: Conditional image-to-video (cI2V) generation aims to synthesize a new plausible video starting from an image (e.g., a person's face) and a condition (e.g., an action class label like smile). The key challenge of the cI2V task lies in the simultaneous generation of realistic spatial appearance and temporal dynamics corresponding to the given image and condition. In this paper, we propose an approach for cI2V using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image. Compared to previous direct-synthesis-based works, our proposed LFDM can better synthesize spatial details and temporal motion by fully utilizing the spatial content of the given image and warping it in the latent space according to the generated temporally-coherent flow. The training of LFDM consists of two separate stages: (1) an unsupervised learning stage to train a latent flow auto-encoder for spatial content generation, including a flow predictor to estimate latent flow between pairs of video frames, and (2) a conditional learning stage to train a 3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike previous DMs operating in pixel space or latent feature space that couples spatial and temporal information, the DM in our LFDM only needs to learn a low-dimensional latent flow space for motion generation, thus being more computationally efficient. We conduct comprehensive experiments on multiple datasets, where LFDM consistently outperforms prior arts. Furthermore, we show that LFDM can be easily adapted to new domains by simply finetuning the image decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM.

Citations (120)

View on Semantic Scholar

Summary

The paper presents a novel cI2V approach with a two-stage LFDM that decouples appearance and motion for enhanced video coherence.
It leverages unsupervised latent flow auto-encoder training and a 3D U-Net diffusion model to generate accurate spatial details and smooth transitions.
Experimental evaluations demonstrate LFDM's superiority over baselines by achieving lower Fréchet Video Distances and improved temporal consistency.

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

The paper presents a novel approach for conditional image-to-video (cI2V) generation by introducing Latent Flow Diffusion Models (LFDM). The primary objective of cI2V is to synthesize realistic videos where the spatial content and temporal dynamics are coherently generated from a given image and a specified condition, such as an action label.

Methodology

LFDM differentiates itself from previous image-to-video generation methods by leveraging latent optical flow sequences to warp the input image in the latent space. This approach addresses common issues faced by direct-synthesis methods, which often struggle to maintain spatial detail and temporal coherence simultaneously. The LFDM framework decouples the generation process into two distinct stages:

Unsupervised Latent Flow Auto-Encoder Training:
- This stage involves training a latent flow auto-encoder (LFAE) to generate spatial content. The LFAE includes a flow predictor to estimate latent flow between video frames, which effectively decouples the appearance from the motion. It facilitates accurate spatial detail retention through flow-based warping.
Conditional 3D U-Net Diffusion Model Training:
- In the second stage, a diffusion model (DM) based on a 3D U-Net architecture generates temporally coherent flow sequences conditioned on the initial image and action label. The DM operates in a low-dimensional latent flow space, focusing solely on motion generation, improving computational efficiency.

Experimental Evaluation

Extensive experiments demonstrate LFDM's superiority over existing methods across multiple video datasets, including MUG (facial expressions), MHAD (human actions), and NATOPS (aviation gestures). LFDM consistently outperforms baselines such as ImaGINator, VDM, and LDM regarding Fréchet Video Distance (FVD), a metric evaluating visual quality, temporal coherence, and sample diversity.

Quantitative Results: LFDM shows marked improvements in FVD across datasets, highlighting its effectiveness in synthesizing diverse and coherent video content.
Qualitative Analysis: Videos generated by LFDM retain better spatial details and exhibit smoother temporal transitions compared to baseline methods. This suggests the effectiveness of latent space operations over direct synthesis approaches.

Implications and Future Work

The implications of LFDM's approach are significant for fields requiring realistic and coherent video generation from static images and simple conditions. Potential applications include data augmentation for machine learning, virtual avatar animation, and enhanced multimedia creation.

Future research directions could explore:

Extending LFDM to handle scenes with multiple moving subjects.
Enhancing cI2V generation with more complex conditions like natural language descriptions, opening paths towards text-to-video generation.
Implementing fast sampling methods to reduce generation time and improve scalability for real-world applications.

Conclusion

LFDM presents a structured and efficient cI2V generation framework by harnessing the power of diffusion models in latent space for motion generation. It demonstrates remarkable improvements in video quality and coherence, offering substantial advancements over current methodologies in the domain of conditional video synthesis.

PDF Markdown

Related Papers

GitHub

GitHub - nihaomiao/CVPR23_LFDM: The pytorch implementation of our CVPR 2023 paper "Conditional Image-to-Video Generation with Latent Flow Diffusion Models" (425 stars)