- The paper introduces a dual-level embedding approach that leverages coarse and fine image prompt representations to ensure precise subject appearance and temporal consistency.
- The methodology integrates a feed-forward video generation pipeline with attention injection and a sequential coarse-to-fine training strategy to mitigate overfitting and enhance frame quality.
- Quantitative evaluations, including CLIP-Image and DINO scores, along with user studies, validate that VideoBooth outperforms competitive baselines in both image and text alignment.
The paper presents a highly technical framework for diffusion‐based video generation that leverages image prompts to precisely control subject appearance while maintaining temporal consistency across frames. The methodology is centered on two complementary embedding strategies integrated within a feed‐forward video generation pipeline built upon pretrained text-to-video diffusion models.
Key Contributions and Methodology
- Dual-Level Visual Embedding:
- At a coarse level, the approach employs a pretrained image encoder (CLIP image encoder) to extract high-level semantic features from the image prompt. A mapping network (multiple MLP layers) projects these features into the text embedding space, replacing specific word embeddings within the text prompt. This replacement directs the model to incorporate the desired subject’s appearance in a robust manner.
- At a fine level, the model injects multi-scale latent representations derived from the image prompt into cross-frame attention modules within the U-Net structure. The latent representation is first passed through the VAE of a Stable Diffusion model and is then noised appropriately according to the diffusion schedule. This representation is appended as additional keys and values in the cross-attention computations. The mechanism first refines the first frame and then propagates the updated values to subsequent frames, ensuring that detailed visual cues are maintained over time.
- Attention Injection Mechanism:
- The fine embedding strategy refines synthesized details with spatial resolution by integrating multi-scale features at distinct cross-frame attention layers. These layers receive image prompt representations corresponding to varied granularities, which helps fine-tune the appearance details initially encoded by the coarse module.
- The design leverages separate trainable key (K) and value (V) projections for the image latent representations to accommodate differences arising from clean backgrounds in image prompts compared to noisy video frames.
- Coarse-to-Fine Training Strategy:
- A notable design choice is the sequential training scheme. The image encoder and the associated MLP mapping are first trained to establish reliable coarse embeddings. Following this, the attention injection module is trained separately so as not to overshadow the coarse signal. Ablation studies show that unified training degrades the encoder’s capacity, leading to overfitting in fine modules and distortion in later frames.
- Data and Evaluation:
- To support this task, a dedicated dataset (VideoBooth dataset) is constructed by extracting image prompts from the first frame of videos in the WebVid dataset using segmentation (Grounded-SAM with noun chunks from spaCy). Rigorous data filtering ensures that only clips with adequately sized and moving objects are retained.
- Quantitative results (e.g., CLIP-Image and DINO similarity scores) show that the proposed framework outperforms state-of-the-art methods in image alignment while maintaining competitive text alignment. User studies corroborate the quantitative findings, with the VideoBooth framework receiving high preference ratings in overall quality, image alignment, and text alignment.
Technical Details and Quantitative Results
- The system replaces target subject word embeddings in the text prompt with the mapped coarse image embedding, thereby bridging the gap between visual content and textual guidance.
- The attention injection is formulated as:
- Q0: query of the first frame;
- d: dimension used for scaling the dot-product;
- K0, V0: original key and value projections from the frame. Subsequent frames are updated using the refined first frame values, ensuring temporal consistency.
- Extensive ablations show that models using only coarse or only fine embeddings suffer from detail loss or temporal distortions, respectively. The full model achieves statistically significant improvements (e.g., achieving a CLIP-Image score of approximately 74.80 and a DINO score of 65.10) compared to competitive baselines such as Textual Inversion, DreamBooth, and ELITE.
Additional Considerations
- The paper also elaborates on a watermark removal module that is appended to the base video model post-training. This module, inspired by U-Net architectures, is finely tuned on separate video data to remove watermarks without compromising the underlying generation quality.
- The work situates itself within a broader trend of personalized content synthesis by combining encoder-based customization techniques with diffusion models, thereby overcoming limitations of long text prompts in specifying visual details.
Overall, the approach enables the generation of high-quality, temporally coherent videos that accurately reflect the appearance details specified by image prompts, all within a feed-forward inference paradigm. This framework addresses key challenges in subject-driven video synthesis and contributes a comprehensive dataset along with extensive ablation studies and user evaluations to validate its design choices.