Still-Moving: Customized Video Generation without Customized Video Data (2407.08674v1)

Published 11 Jul 2024 in cs.CV

Abstract: Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight $\textit{Spatial Adapters}$ that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on $\textit{"frozen videos"}$ (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel $\textit{Motion Adapter}$ module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

PDF HTML Abstract

Customized Video Generation without Customized Video Data

The paper "Still-Moving: Customized Video Generation without Customized Video Data" by Chefer et al. introduces an innovative approach to customizing text-to-video (T2V) models without the need for customized video data, leveraging advancements in text-to-image (T2I) models. The core challenge addressed is the effective integration of a customized T2I model into a T2V framework without significant artifacts or loss of adherence to customization. This is achieved through a novel architectural component termed as "Spatial Adapters," complemented by "Motion Adapters" to maintain the motion priors intrinsic to video models.

Introduction

The substantial progress in customizing T2I models for tasks such as personalization, stylization, and conditional generation has not been paralleled in T2V models, primarily due to the scarcity of customized video data. The authors propose "Still-Moving" to bridge this gap by enabling a T2V model to inherit the customization of a T2I model while preserving the motion characteristics (or priors) learned from extensive video data.

Methodology

The proposed methodology hinges on a two-tiered adaption framework:

Spatial Adapters: These are lightweight modules trained to adjust the feature distributions of a customized T2I model injected into a T2V model. The adaptation process is critical to mitigate the feature mismatch when blending the customized T2I features with the temporal dynamics of the T2V model.
Motion Adapters: These serve to preserve the motion priors by allowing the T2V model to be trained on static images (frozen videos) without losing its inherent ability to generate motion. They modify the temporal blocks during training but are removed during inference to restore the original motion priors of the T2V model.

This dual-adaptation ensures that the customized spatial properties from the T2I model are seamlessly integrated with the temporal dynamics from the T2V model, circumventing the need for customized video data.

Experimental Results

The authors validate their approach on two state-of-the-art T2V models, Lumiere and AnimateDiff, which are built upon T2I frameworks. The experiments cover various customization tasks such as personalized video generation, stylized video generation, and conditional video generation facilitated by ControlNet. The results demonstrate that the Still-Moving framework successfully integrates the spatial priors of customized T2I models into T2V models while maintaining robust motion priors.

Key numerical results from their evaluations show significant improvements in both forms of customization. For personalized video generation, the method achieves high fidelity to the customized data and prompt alignment, as evidenced by CLIP scores. For example, CLIP-Image (CLIP-I) scores for personalization reach 0.772 compared to 0.680 for naive injection, reflecting superior consistency with reference images. Similarly, CLIP-Text (CLIP-T) scores indicate better adherence to text prompts.

Qualitative Analysis

The qualitative results emphasize the efficacy of the proposed method. Examples include realistic motions tailored to specific subjects and styles, such as a "plasticine cat" exhibiting both accurate spatial features and dynamic expressions. This synergy between spatial and temporal priors is consistent across various domains, evident in the creative yet contextually coherent outputs showcased in the paper.

Implications and Future Directions

The implications of this research are both practical and theoretical. Practically, it facilitates seamless customization of video content, significantly expanding the utility of generative models in creative industries and user-driven content creation without the overhead of extensive customized video datasets. Theoretically, it advances our understanding of feature integration across different generative domains, particularly in translating 2D image priors to dynamic 3D-like video outputs without compromising on inherent motion characteristics.

Future research could explore further refinement of the adapter modules for even more granular control, potentially integrating end-to-end optimization techniques that dynamically adapt both spatial and temporal features more cohesively. Additionally, extending this framework to multi-modal generative tasks could yield new possibilities in interactive and immersive content generation.

In conclusion, the paper presents a well-founded and empirically validated framework that overcomes a significant barrier in video customization, paving the way for broader and more sophisticated applications of generative models in dynamic content creation.