VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation (2502.07531v3)

Published 11 Feb 2025 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera motion or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. VidCRAFT3 integrates three core components: Image2Cloud generates 3D point cloud from a reference image; ObjMotionNet encodes sparse object trajectories using multi-scale optical flow features; and Spatial Triple-Attention Transformer incorporates lighting direction embeddings via parallel cross-attention modules. Additionally, we introduce the VideoLightingDirection dataset, providing synthetic yet realistic video clips with accurate per-frame lighting direction annotations, effectively mitigating the lack of annotated real-world datasets. We further adopt a three-stage training strategy, ensuring robust learning even without joint multi-element annotations. Extensive experiments show that VidCRAFT3 produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence. Code and data will be publicly available.

Summary

The paper proposes a modular diffusion framework that disentangles camera motion, object motion, and lighting control for coherent video generation.
It leverages an Image2Cloud module for robust 3D structure extraction and an ObjMotionNet for precise object motion tracking via dense and sparse trajectory encoding.
Quantitative results, including improved FID, FVD, PSNR, SSIM, and ObjMC scores, validate the framework's superior control precision and visual consistency.

The paper addresses the challenge of synthesizing temporally coherent videos from still images while providing user-controllable manipulation of multiple visual elements simultaneously—specifically camera motion, object motion, and lighting direction. The approach leverages a diffusion-based image-to-video generation framework that is disentangled into specialized modules to handle each control signal independently and effectively.

Key Components and Architecture

Image2Cloud Module:

This module transforms a single input image into a 3D point cloud representation using an unconstrained stereo reconstruction method. It utilizes point regression and global alignment techniques to obtain a robust 3D structure, which is then rendered along a user-specified camera trajectory, thereby ensuring spatial consistency and facilitating fine-grained camera motion control. The rendered point cloud serves as an additional condition in a dual-stream injection within the UNet encoder, wherein the reference image (encoded via a CLIP Image Encoder) is merged with point cloud renderings through cross-attention mechanisms. This approach aids in maintaining input image fidelity across generated video frames.

ObjMotionNet Module:

To guide object motion, the method encodes sparse spatial trajectories defined by pixel coordinates across video frames. Sparse motion vectors—computed from optical flow between corresponding trajectory points—are first irregularly distributed over the image grid. These vectors are then transformed into dense representations via Gaussian smoothing. The resulting smoothed optical flow tensor is further processed through a series of convolutional layers with downsampling to extract multi-scale motion features. By integrating these features into the UNet encoder via element-wise addition, the model ensures that localized object motion is aligned with the global video synthesis process.

Spatial Triple-Attention Transformer:

Lighting control is achieved by encoding the directional vector (a unit 3D vector) via Spherical Harmonic (SH) encoding. The SH-encoded lighting representation—resulting in 16 coefficients—is mapped into a high-dimensional feature space using an MLP. This lighting embedding is then integrated into the generation process through a dedicated lighting cross-attention module. Uniquely, this module operates in parallel with image cross-attention and text cross-attention within a Spatial Triple-Attention Transformer, and the aggregated outputs produce a fused feature representation. This design ensures that the generated video frames incorporate consistent and realistic illumination effects that adhere to the specified lighting condition.

Training Strategy and Data Utilization

A three-stage training strategy is proposed to handle the absence of real-world videos with simultaneous annotations for camera, object, and lighting control:

Stage 1 (Camera Motion Control): The entire UNet architecture is fine-tuned using a dataset derived from RealEstate10K, which contains 62,000 25-frame clips. The focus here is on learning spatial structures and temporal dynamics by using point cloud renderings along annotated camera trajectories. The model is trained for 40,000 iterations, ensuring robust camera motion control.
Stage 2 (Dense Object Trajectories and Lighting Mixed Fine-tuning): Here, a mixed dataset that includes both object motion and lighting direction annotations is used. Dense object motion trajectories are employed along with lighting cues to guide the model towards learning detailed spatial structures—such as object contours, shadows, and highlights—while preserving temporal consistency. During this stage, temporal layers are frozen to maintain the learned motion dynamics. Training is performed for 20,000 iterations.
Stage 3 (Sparse Object Trajectories and Lighting Fine-tuning): This stage adapts the model to work on sparse object motion controls—more reflective of real-world inputs—by using Gaussian-filtered sparse trajectories. This adjustment enables the network to generalize better during inference. An additional 20,000 iterations are used to fine-tune the model.

Additionally, the paper introduces a dedicated synthetic VideoLightingDirection (VLD) dataset. This dataset, generated using Blender with HDR environment maps and artist-designed 3D models, provides precise per-frame lighting annotations and facilitates the learning of complex illumination effects, including reflections and strong light transmissions.

Quantitative and Qualitative Results

The proposed method demonstrates strong performance improvements over state-of-the-art techniques on multiple benchmarks:

On a RealEstate10K evaluation set, the proposed framework achieves a FID of 75.62, FVD of 49.77, CLIPSIM of 32.32, and a notably higher PSNR (18.04) and SSIM (0.63) compared to competing methods. In terms of camera motion control, it registers a lower CamMC score (4.07), indicating higher accuracy in aligning generated camera trajectories with ground truth.
On the WebVid-10M dataset used for object motion evaluation, the framework attains an ObjMC score of 3.51, outperforming other approaches that utilize optical flow and bounding box trajectories, thereby demonstrating superior object motion fidelity and visual consistency.
The paper also includes extensive ablation studies showing that a training strategy combining dense trajectories for robust learning with sparse trajectories for generalization yields the best performance. Furthermore, among different lighting embedding integration methods, the incorporation of a dedicated lighting cross-attention mechanism produces a PSNR of 19.49, SSIM of 0.74, and LPIPS of 0.11, leading to a more faithful representation of illumination effects compared with alternative strategies.

Additional Architectural Details

The generation process is modeled via a diffusion framework wherein video synthesis is conditioned on multiple modalities:

The input image $I$ is encoded via a VAE encoder $\mathcal{E}$ , with subsequent decoding by $\mathcal{D}$ to generate the final video sequence $I_1,\dots,I_F$ .
Video diffusion is performed on a latent space that is modulated by dual attention streams, where the lighting cross-attention integrates the lighting embedding $\mathbf{E}_{\text{light}}$ $E_{light}$ into the feature space. The lighting attention operation is mathematically defined as
- $Q$ : Query embeddings from UNet,
- $K$ : Key embeddings from lighting direction,
- $V$ : Value embeddings from lighting direction,
- $d$ : Embedding dimension.

Conclusion

In summary, the paper presents a comprehensive framework that advances the state of image-to-video generation by enabling simultaneous and disentangled control of camera, object, and lighting parameters. Through a combination of 3D point cloud rendering, trajectory-based motion encoding, and multi-modal attention for lighting control, the work effectively bridges architectural design and data synthesis challenges. The quantitative results and the ablation studies provide compelling evidence that the proposed framework outperforms previous methods across several metrics related to visual quality, temporal coherence, and control precision, thus contributing a significant step towards more versatile and controllable video synthesis frameworks.