Diffusion as Shader (DaS) Framework
- DaS is a 3D-aware video diffusion framework that uses dynamic 3D tracking signals to maintain temporal and geometric consistency.
- It leverages colored 3D point clouds to encode depth and motion, facilitating diverse controls like mesh-to-video rendering and camera trajectory manipulation.
- The framework outperforms 2D-conditioned models by achieving up to 60% improvement in pose accuracy while ensuring fine-grained temporal coherence.
Diffusion as Shader (DaS) is a unified, 3D-aware video diffusion framework designed to enable precise and versatile control over video generation. DaS integrates 3D tracking video signals as conditioning inputs for a diffusion transformer (DiT) backbone, reframing the video generation process as a form of temporal “shading” on point clouds. This paradigm allows DaS to naturally support diverse control tasks such as mesh-to-video rendering, motion transfer, camera trajectory control, and targeted object manipulation, all within a single architecture. Central to DaS is the use of dynamically tracked 3D points colored by their 3D coordinates, captured over time, which provides the necessary geometric and temporal coherence for fine-grained video generation control (Gu et al., 7 Jan 2025).
1. Motivation and Problem Landscape
Video diffusion models conditioned only on 2D guidance signals—such as optical flow, 2D keypoints, bounding-boxes, or depth maps—struggle to enforce strict temporal and geometric consistency. These 2D-based controllers fail to represent the intrinsic 3D structure underpinning most real-world video content, resulting in temporal flickering and an inability to perform localized, geometry-aware edits (e.g., moving one limb independently in 3D space). As a result, precise control over attributes such as camera pose, subject motion, or object-specific manipulation has been limited by the lack of 3D representations within the diffusion process.
DaS addresses these limitations by leveraging 3D tracking videos—sequences of colored 3D points tracked under known camera trajectories—as control signals for the diffusion model. Each 3D point is consistently labeled (typically by its mapped RGB color from the initial frame), enabling the model to maintain consistent appearance and motion across generated frames. This reframing of the video generation task as “shading” on a mutable 3D substrate allows for richer, multi-modal user control and bridges the gap between 3D graphics pipelines and diffusion-based synthesis (Gu et al., 7 Jan 2025).
2. Methodological Pipeline
The DaS approach is instantiated as an image-to-video latent diffusion model, originating from finetuning CogVideoX and introducing an additional ControlNet-style branch for 3D tracking data. The workflow for both preprocessing and generative inference operates as follows:
- 3D Tracking Signal Preparation
- Sample or estimate a set of 3D points across the video frames.
- Encode each point’s initial 3D location as an RGB color (often by normalized reciprocal-depth).
- For each frame , render the 3D points under the current camera pose, yielding a 2D "3D tracking video" .
Video Generation Pipeline
- Input a keyframe image and pad to .
- Encode both the padded and via a shared VAE encoder to produce latent variables , .
- Initialize noise latent .
- Perform iterative reverse diffusion with the DiT architecture, injecting 3D conditioning features at each residual/attention block via the ControlNet mechanism.
- Decode the final latent to produce the RGB output video.
High-level pseudocode:
1 2 3 4 5 6 7 8 9 |
z_I = VAE.encode(pad_to_T(I)) # [T/4, H/8, W/8, C] z_3D = VAE.encode(tracking_video) # [T/4, H/8, W/8, C] cond_features = CondDiT(z_3D) # features per block x_T = Normal(0, 1) for t in T..1: x_{t-1} = DiT_block_i(x_t, z_I, cond_features[i], t) x_t = x_{t-1} video = VAE.decode(x_0) |
This architecture enables the injection of 3D guidance signals at multiple model depths, facilitating both global and local control effects across the generated sequence (Gu et al., 7 Jan 2025).
3. Model Architecture and Conditioning Dynamics
The DaS backbone is a latent-space DiT derived from CogVideoX, consisting of 42 residual and attention blocks. To inject 3D control signals:
The first 18 layers of the main DiT are duplicated as a "condition DiT" to process .
At every corresponding block, features from the condition DiT are linearly projected (initialized at zero) and added to the main DiT’s features .
This approach, inspired by ControlNet, ensures pretrained diffusion weights remain undisturbed while allowing the 3D tracking signal to modulate the reverse process effectively.
Self-attention modules operate over the entire -frame temporal axis, supporting global coherence.
Key equations for the diffusion process:
- Noising step (forward process):
- Denoising step (reverse process):
3D geometric encoding is achieved implicitly; the colored 3D point cloud encodes depth and motion, obviating the need for an explicit pose network (Gu et al., 7 Jan 2025).
4. Training Regimen and Loss Specification
The finetuning process updates only the condition DiT branch. The training set combines real videos (from MiraData) and synthetic sequences (from Mixamo), with real-world content tracked at approximately 4.9k points per frame.
- Loss Function:
The sole training objective is the standard diffusion (DDPM) loss:
No mask or geometry-specific losses are applied; all 3D consistency emerges from the structure of the colored 3D tracking video.
Classifier guidance is omitted during training but employed at inference as needed.
Training utilized 8×H800 GPUs for 3 days, running 2,000 gradient-accumulated steps (batch size 64, learning rate 1e-4, AdamW optimizer). Approximately 10k videos, each of 49 frames and resolution, were processed (Gu et al., 7 Jan 2025).
5. Supported Control Modalities via 3D Tracking Manipulation
Control is achieved by how the 3D tracking input is constructed and edited before being provided to the diffusion model. Four major use cases are implemented as follows:
| Control Task | 3D Tracking Signal Construction | Output |
|---|---|---|
| Mesh → Video | Mesh sequence in Blender; dense surface sampling | Photorealistic animation from stylized mesh input |
| Motion Transfer | 3D track from source video; repaint keyframe | New-action video with target style/content |
| Camera Control | Depth on plus new camera path projection | Video from novel 3D viewpoints |
| Object Manipulation | Depth + object segmentation; 3D transform object points | Videos with user-specified object movement |
In each scenario, altering the 3D tracking video directly determines the spatiotemporal or semantic guidance afforded to the model. For example, camera trajectories or object-centric motions can be defined programmatically or interactively, which DaS translates into temporally coherent, photorealistic video renderings (Gu et al., 7 Jan 2025).
6. Quantitative and Qualitative Evaluation
DaS demonstrates significant advantages over prior art across several tasks:
Camera Control: On 100 RealEstate10K and synthetic trajectories:
- Translational Error (TransErr): DaS ≈ 37° versus MotionCtrl/CameraCtrl 67°
- Rotational Error (RotErr): DaS ≈ 10° versus MotionCtrl 39°, CameraCtrl 30°
- This results in a 40–60% improvement in pose accuracy.
- Motion Transfer: CLIP-based metrics (Text/Temp) on 50 DAVIS and MiraData videos:
- Text: 32.6 (DaS) versus 16.9 (CCEdit), 31.9 (TokenFlow)
- Temporal Consistency: 0.971 (DaS) versus 0.932 (CCEdit), 0.956 (TokenFlow)
- Mesh-to-Video & Object Manipulation: Qualitative persistence of 3D structure, texture, and style, notably outperforming CHAMP on human SMPL animations.
- Ablation: Replacing full 3D tracking with depth-only reduces PSNR by 1.2 dB, SSIM by 0.08, and increases FVD by 90. Optimal point cloud density observed at ~4.9k points/frame, with diminishing returns beyond this density (Gu et al., 7 Jan 2025).
7. Discussion, Limitations, and Future Directions
DaS's unification of multiple video control modalities through 3D-aware conditioning enables strong geometric and temporal coherence, with practical data efficiency—fine-tuning on under 10k videos over 3 days suffices to surpass specialized alternatives.
Limitations include sensitivity to mismatches between the tracked 3D points and the keyframe image. If regions are untracked or inconsistent, the model may hallucinate new scene content or allow unconstrained drift. The system relies on complete and compatible 3D tracking data for optimal performance.
Future research opportunities include learning to hallucinate missing or incomplete 3D tracklets, performing end-to-end diffusion in the 3D tracking domain (e.g., sketch- or text-based 3D motion prompting), and integrating physical simulation priors (e.g., rigidity, fluid dynamics) as auxiliary “shader”-like controls (Gu et al., 7 Jan 2025).