Multi-subject Open-set Personalization in Video Generation (2501.06187v1)

Published 10 Jan 2025 in cs.CV

Abstract: Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.

PDF Abstract

Multi-subject Open-set Personalization in Video Generation: A Comprehensive Analysis

The paper introduces a novel approach for personalized video generation, addressing significant limitations in existing methodologies. The proposed system, Video Alchemist, enables multi-subject, open-set video personalization, integrating both foreground objects and backgrounds without the need for time-consuming test-time optimization. The architecture leverages a Diffusion Transformer model, which synthesizes videos by effectively fusing conditional image inputs with subject-level text prompts through cross-attention layers.

Key Challenges in Video Personalization

The existing body of work in video personalization primarily suffers from three major challenges:

Domain Limitation: Most methods are restricted to closed-set object categories or single subjects, thus limiting their applicability to diverse real-world scenarios.
Optimization Requirements: A significant number of contemporary techniques require subject-specific optimizations or fine-tunings, which can be computationally expensive and inefficient.
Evaluation Metrics: Current evaluation frameworks fail to comprehensively assess personalization models, particularly when dealing with multiple entities or novel subject backgrounds.

The paper addresses these challenges head-on by introducing Video Alchemist, which integrates a new Diffusion Transformer module. This module enhances performance by allowing multiple image and text inputs, facilitating the rendering of complex scenes involving several subjects and intricate backgrounds.

Methodological Advancements

The architecture of Video Alchemist distinguishes itself through several innovative components:

Diffusion Transformer Module: The novel use of diffusion transformers enables a holistic integration of text and image conditions. The cross-attention layers enhance the model's ability to associate image tokens with corresponding text descriptors effectively.
Data Construction Pipeline: The paper introduces an automatic data construction pipeline that uses video frame sampling and image augmentations. This pipeline is crucial for mitigating overfitting, ensuring that the model learns to preserve subject identity while enabling flexible and context-aware video synthesis.
Multi-subject Conditioning: This feature is achieved through a novel subject-level fusion process that combines word descriptions and image embeddings, maintaining subject fidelity across diverse contexts.

Evaluation and Benchmarking

To benchmark the model's efficacy, the paper introduces the MSRVTT-Personalization dataset. This dataset sets a new standard for evaluating open-set video personalization, emphasizing accurate subject fidelity and supporting diverse scenarios. It provides a robust framework for evaluating multi-subject video personalization by calculating metrics that include text, video, and subject similarity, as well as dynamic degree.

Results and Insights

Extensive experiments reveal that Video Alchemist significantly outperforms existing methods, both quantitatively and qualitatively. The model excels not only in preserving subject identity but also in generating videos with realistic and varied dynamics. By foregoing subject-specific optimizations and employing a more generalized model architecture, the system achieves remarkable flexibility and efficiency.

Future Directions

The implications of this work are profound, opening new avenues for personalized video synthesis in artificial intelligence. Future research could focus on enhancing background realism and subject interactions, thereby broadening the potential applications in virtual reality and digital media storytelling. Additionally, exploring optimization strategies for increased scalability and incorporating more sophisticated semantic understanding could further enhance video generation quality.

In summary, the paper presents a significant advancement in video personalization technology, overcoming existing limitations and setting a new benchmark for the field. The innovative methodologies and compelling results demonstrate the practical and theoretical potential of diffusion transformers in future AI-driven video applications.