- The paper introduces a novel method that synthesizes temporally consistent videos by integrating 3D geometry proxies and 2D diffusion models.
- It employs advanced attention feature injection and UV-space blending techniques to maintain global coherence and preserve local details.
- Experimental results demonstrate high frame consistency and robust prompt fidelity, validating the approach for controllable video synthesis.
The paper presents a method for synthesizing temporally consistent animations by bridging low-fidelity 3D geometry with high-quality 2D diffusion-based rendering, thus enabling controllable 4D-guided video generation. The approach leverages a pre-trained T2I (Text-to-Image*) diffusion model augmented by depth-based conditioning (via a ControlNet) to transform a sequence of rendered depth and UV (texture coordinate) maps into stylized video frames. The key contributions and technical novelties can be summarized as follows:
4D Guidance via Proxy Geometry
- The method accepts a 3D scene represented as an animated mesh along with its associated guiding channels: depth maps and UV coordinate maps that encode both texture coordinates and object identity.
- By utilizing the canonical UV space as a persistent representation, the method facilitates temporal correspondence across frames, which is crucial for maintaining consistency in the final output.
Attention Feature Injection and UV-space Blending
- The diffusion model’s self-attention modules serve as the primary mechanism for capturing spatial self-similarity. The paper presents two complementary operations that manipulate these modules:
- Pre-Attention Feature Injection: The key and value features of the self-attention module are augmented by concatenating the features from a sparse set of keyframes. This extended attention mechanism reinforces global consistency without the prohibitive cost of processing all frames simultaneously.
- Post-Attention Feature Injection: Following the attention operation, features are re-projected between frames by using known ground truth correspondences from the UV maps. This operation is mathematically described as
- F(j,l)=πi,j(F(i,l)),
- with πi,j denoting the pixel-level reprojection guided by the UV coordinates. However, the authors note that raw re-projection leads to blur when used in isolation, motivating the need for a blended approach.
- The proposed UV-space feature blending is performed by projecting both pre- and post-attention features into the canonical UV space, where features are sequentially aggregated. A weighted combination subsequently fuses the blended UV features with the current frame’s features to balance global structural consistency and local detail preservation.
UV Noise Initialization
- Traditional video editing strategies based on DDIM inversion are ineffective on untextured renders. In contrast, the method introduces a noise initialization strategy by sampling Gaussian noise in the canonical UV space.
- Once generated, this noise is projected to each frame using the UV correspondences, resulting in initial latent codes that preserve temporal correlations across the video sequence. This initialization is critical to mitigate texture sticking and unwanted artifacts.
Latent Normalization for Color Stability
- Observing significant color and contrast shifts across frames, the authors incorporate an Adaptive Instance Normalization (AdaIN) process. By computing statistics solely on the background region, the approach stabilizes the latent distributions and thereby improves the consistency of the decoded RGB outputs.
Experimental Validation and Quantitative Evaluations
- The method is evaluated on diverse scenarios including camera rotations, physical simulations, and character animations. Qualitatively, the approach demonstrates robustness to a variety of prompts while accurately capturing interaction cues such as shadows and lighting changes.
- Quantitatively, the method reports competitive frame consistency and prompt fidelity metrics based on CLIP embedding cosine similarities. For instance, the frame consistency metric achieves values around 0.9845, which is competitive compared to prior baselines that either rely solely on pre-attention propagation or suffer from significant blurring when using token propagation.
Ablation Studies and Limitations
- Extensive ablation studies show that neither pre-attention nor post-attention feature injection alone is sufficient. Their combination is essential to balance temporal smoothness with high-fidelity renderings.
- The effectiveness of the UV noise initialization is further confirmed when compared to fixed noise strategies, where the latter results in noticeable artifacts.
- The authors also discuss limitations stemming from the low spatial dimensionality of the latent representation (e.g., 64×64) and the sensitivity of the VAE decoder to minute latent inconsistencies, which occasionally results in misalignments or perspective-related artifacts.
Overall, the paper thoroughly details a pipeline that converts low-fidelity 3D animations into high-quality, consistent stylized videos. By combining geometric consistency through canonical UV mappings with advanced manipulation of self-attention mechanisms and tailored noise initialization, the approach pushes the boundaries of controllable video synthesis using 2D diffusion models. This novel integration of 3D-guided correspondences into the diffusion process offers a promising avenue for bridging traditional rendering pipelines with state-of-the-art generative techniques.