CamContextI2V: Context-aware Controllable Video Generation (2504.06022v1)

Published 8 Apr 2025 in cs.CV

Abstract: Recently, image-to-video (I2V) diffusion models have demonstrated impressive scene understanding and generative quality, incorporating image conditions to guide generation. However, these models primarily animate static images without extending beyond their provided context. Introducing additional constraints, such as camera trajectories, can enhance diversity but often degrades visual quality, limiting their applicability for tasks requiring faithful scene representation. We propose CamContextI2V, an I2V model that integrates multiple image conditions with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details. This enables more coherent and context-aware video generation. Moreover, we motivate the necessity of temporal awareness for an effective context representation. Our comprehensive study on the RealEstate10K dataset demonstrates improvements in visual quality and camera controllability. We make our code and models publicly available at: https://github.com/LDenninger/CamContextI2V.

Summary

The paper presents CamContextI2V, which enhances video generation by integrating multiple context images via a dual-stream encoder that combines semantic and visual cues.
It employs epipolar cross-attention and temporal embeddings to enforce 3D consistency and frame-specific conditioning, reducing visual artifacts during wide camera movements.
Experimental results show a 24% FVD improvement and better MSE/SSIM metrics, confirming the method's effectiveness in generating coherent and realistic videos.

This paper introduces CamContextI2V, a method for improving camera-controllable image-to-video (I2V) generation by incorporating multiple context images alongside the primary reference image. The core problem addressed is that standard I2V models, even with camera control, struggle to generate high-quality, consistent video when the camera moves significantly beyond the view presented in the single reference frame, leading to visual degradation and unrealistic scene invention. CamContextI2V aims to provide richer scene context to the diffusion model, enabling more coherent generation for wider camera movements.

Core Idea

The central innovation is a Context-aware Encoder module that processes multiple additional context images ( $c_{ctx}^0, \dots, c_{ctx}^N$ ) and their corresponding camera poses ( $P_{ctx}^0, \dots, P_{ctx}^N$ ). This module works in conjunction with the standard inputs (reference image $c_{img}$ , optional text $c_{txt}$ , target camera trajectory $[P_{cam}^0, \dots, P_{cam}^{16}]$ ) and injects learned context features into a base camera-controllable I2V diffusion model (specifically, building on CamI2V (2410.15957), which extends DynamiCrafter (2310.12190)).

Method Implementation Details

The Context-aware Encoder uses a dual-stream approach to process the context images:

Semantic Stream:
- Takes CLIP embeddings of the reference image, text prompt, and all context images ( $[\mathbf{F}_{img}, \mathbf{F}_{txt}, \mathbf{F}_{ctx}]$ ).
- Uses a query transformer ( $\mathcal{E}_{sem}$ ) with learnable latent query tokens ( $\mathbf{T}_{sem}$ ) to aggregate information across these modalities.
- Outputs a global semantic representation ( $\mathbf{F}_{sem}$ ) capturing high-level concepts.
- This stream is initialized from and fine-tuned based on DynamiCrafter's existing cross-modal attention mechanism.
Visual Stream:
- Focuses on providing detailed, pixel-level visual cues.
- Context images are first encoded into the latent space using the VAE encoder ( $\mathbf{Z}_{ctx} = [z_{ctx}^0, \dots, z_{ctx}^{N}]$ ).
- Introduces pixel-wise learnable context tokens $\mathbf{T}_{vis} \in \mathbb{R}^{T \times h \times w \times D}$ (where T=16 is video length, h, w are latent dimensions, D is feature dim). These tokens act as queries, one for each pixel location in each frame of the target video.
- Uses a query transformer ( $\mathcal{E}_{vis}$ ) to allow these tokens to retrieve features from the latent context frames $\mathbf{Z}_{ctx}$ .

* 3D Awareness (Epipolar Cross-Attention): This is a key component of the visual stream. * For each query token $t_i$ (representing pixel $(u,v)$ at target timestep $t$ ), its corresponding target camera pose $P_{cam}^t$ and the context poses $P_{ctx}^j$ are used to compute the epipolar line $l_{ij}$ in each latent context frame $z_{ctx}^j$ . * An epipolar mask $m$ is calculated based on the distance of pixels $(u', v')$ in the context latent features to their corresponding epipolar line. Pixels far from the line (distance > $\delta$ , where $\delta$ is half the latent diagonal) are masked out. * The cross-attention mechanism within $\mathcal{E}_{vis}$ is modified to incorporate this mask:

$EpiCrossAttn(q, k, v, m) = \text{softmax}(\frac{qk^\intercal}{\sqrt{d} \odot m})v$

* This forces query tokens to attend only to geometrically plausible regions in the context views, improving 3D consistency.

* Temporal Awareness: To make the conditioning timestep-specific (unlike DynamiCrafter's static image condition), sinusoidal timestep embeddings are concatenated to both the semantic and visual context features before further processing (e.g., feed-forward layers in the query transformers).

* Integration: * The visual stream output $\mathbf{F}_{vis}$ is combined with the original reference image latent $z_{ref}$ using a 3D zero-convolution. This learnable layer controls the weighting between the original reference and the new context information before they are added together and fed into the U-Net's input. * The semantic stream output $\mathbf{F}_{sem}$ is injected into the U-Net layers via spatial cross-attention, similar to the baseline model.

Training and Implementation Considerations

Base Model: CamContextI2V builds upon a pre-trained CamI2V model.
Parameter Freezing: Crucially, the parameters of the base diffusion U-Net and the camera pose encoder (using Plücker coordinates) are frozen during CamContextI2V training. Only the newly introduced Context-aware Encoder modules (semantic stream's query transformer adaptation, visual stream's query transformer, learnable tokens $\mathbf{T}_{vis}$ , and the 3D zero-convolution) are trained.
Dataset: Trained and evaluated on RealEstate10K (1805.09817), which provides videos with camera pose annotations. Videos are processed into 16-frame clips at 256x256 resolution.
Training Details:
- Initialized from a CamI2V checkpoint (trained for 50k steps).
- Trained the context encoder for 50k iterations.
- Adam optimizer, LR=1e-4, Batch Size=64.
- Hardware: 4x A100 GPUs (~7 days).
- Mixed-precision training with DeepSpeed ZeRo-1.
- During training, 1-4 context frames are sampled uniformly from the entire source video clip.
Evaluation:
- Context frames are sampled from outside the 16-frame generated window to ensure they provide genuinely new information.
- Uses standard metrics: FVD for video quality, per-frame MSE/SSIM for context faithfulness, and GLOMAP-based RotErr/TransErr/CamMC for camera trajectory accuracy.
Inference: Uses DDIM sampling with 25 steps. Notably, it performs best with a lower CFG value (3.5) compared to baselines (7.5), suggesting the context conditioning provides strong guidance.

Practical Applications & Key Results

Application: Generating videos from images with significant camera motion where the scene extends beyond the initial view. Useful for virtual walkthroughs, architectural previews, cinematic shots, or any application requiring consistent scene generation under camera control.
Improved Visual Quality: Achieves a 24.09% improvement in FVD (VideoGPT variant) over the CamI2V baseline on RealEstate10K. Qualitative results show fewer artifacts and more plausible rendering of areas outside the reference frame's view.
Enhanced Consistency: Lower MSE and higher SSIM compared to baselines, especially in later frames of the video (Figure 4), demonstrating the effectiveness of using context views to maintain quality over time.
Better 3D Structure: Slight improvements in camera trajectory metrics (RotErr, TransErr, CamMC) suggest the generated videos have better underlying 3D consistency, which aids the SfM process (GLOMAP) used for evaluation.
Robustness: Ablation studies show the importance of both context streams (semantic and visual), the epipolar attention (3D awareness), and temporal awareness. The method is somewhat robust to the choice of context frames, effectively filtering irrelevant information when provided with distant views (Table 3, Figure 7).

Code and Model Availability

The authors provide code and models publicly at: https://github.com/LDenninger/CamContextI2V

Summary Table

Feature	Implementation Detail	Practical Implication
Core Problem	Visual degradation in I2V models when camera moves beyond the initial reference frame's context.	Limits usability for dynamic camera shots or exploring scenes from a single image.
Solution	CamContextI2V: Inject context from multiple posed images via a dual-stream context encoder into a base I2V model.	Enables more consistent and higher-quality video generation for wider camera motion.
Context Encoder	Dual-stream (Semantic + Visual) using query transformers.	Captures both high-level concepts and fine-grained visual details from context.
3D Awareness	Epipolar Cross-Attention in the visual stream, using camera poses to mask attention based on epipolar geometry.	Enforces geometric consistency, filters irrelevant context, improves 3D structure.
Temporal Awareness	Sinusoidal timestep embeddings added to context features.	Makes context conditioning specific to each frame in the generated video.
Training Strategy	Fine-tune only the context encoder on top of a frozen pre-trained CamI2V model.	Efficient training, leveraging existing powerful generative models.
Data Requirement	Video dataset with camera pose annotations (e.g., RealEstate10K).	Needs suitable data for training and selecting context views during inference.
Key Result	24% FVD improvement, better MSE/SSIM esp. later frames, improved qualitative consistency.	Significantly enhances video quality and coherence for camera-controlled generation.
Inference Tuning	Performs better with lower CFG (3.5).	Suggests strong conditioning from the context features.

PDF Markdown