DRAW2ACT: Depth-Aware Video Synthesis
- DRAW2ACT is a depth-aware video synthesis framework that uses 3D trajectory conditioning and multi-modal diffusion to create high-fidelity robotic demonstration videos.
- It employs orthogonal representations from RGB frames, depth cues, and DINOv2 object features to ensure robust spatio-temporal consistency in video outputs.
- Joint synthesis of synchronized RGB–depth videos feeds a lightweight policy regressor, significantly boosting robotic manipulation success over prior baselines.
DRAW2ACT is a depth-aware, trajectory-conditioned video generation framework designed to produce high-fidelity, controllable robotic demonstration videos for manipulation tasks. Unlike prior approaches that condition on 2D trajectories or single modalities, DRAW2ACT encodes a 3D control trajectory and injects multiple, orthogonal representations—capturing depth, semantics, shape, and motion—into a video diffusion model. The system jointly synthesizes spatially aligned RGB and depth videos via a Latent Diffusion Transformer, using cross-modality attention and depth supervision to guarantee spatio-temporal consistency. A lightweight multimodal policy regressor consumes these generated videos to produce the robot’s joint commands, resulting in downstream manipulation that achieves significantly higher success rates and visual consistency compared to baseline methods (Bai et al., 16 Dec 2025).
1. Orthogonal Representation of 3D Trajectory Conditioning
At the core of DRAW2ACT is an explicit 3D “control trajectory”
where indicates per-frame object center pixel-coordinates and the object's relative depth, computed using Video Depth Anything at each . From , three orthogonal representations are extracted and used as conditioning inputs:
- Reference-Frame Latent : The entire color-coded trajectory (with start/end points) is overlaid on the initial RGB frame, forming . This is embedded by a pretrained 3D-causal VAE to yield , which is concatenated along the temporal axis to the noisy latent at inference for diffusion-based generation.
- DINOv2 Object Features : The object mask is segmented on the first frame via Grounded-SAM+TrackAnything. The cropped object is processed with DINOv2, producing a -dimensional feature, which is pasted spatially onto each frame at the corresponding trajectory location (bilinearly interpolated), aligning features with depth and suppressing background noise. This feature sequence is temporally compressed to frames to form , typically .
- Coordinate-Augmented Text Prompt : A natural-language robotic task description with explicit coordinates is encoded with a T5 model, producing for cross-attention in the diffusion transformer.
The complete conditioning set is , providing complementary cues for depth, semantics, and high-level intent.
2. Latent Diffusion Transformer with Multi-Modal Conditioning
The generative core is a latent diffusion model leveraging a DiT denoiser . The VAE encoder maps a ground-truth video to a latent ; a forward SDE corrupts it to , and is trained to predict the added noise. The objective is:
Within each DiT block, conditioning is performed via:
- Self-attention on to propagate depth and motion cues across all time steps.
- Cross-attention to : queries from DiT hidden states, keys/values from the text prompt encoder outputs.
- Gated residual fusion of : a gating vector modulates the object features and injects them into the hidden states at every block:
where is the flattened spatio-temporal hidden state, denotes elementwise product, and is the sigmoid.
This architecture supports sustained semantic, spatial, and temporal alignment of generated content with the input trajectory and object.
3. Joint RGB and Depth Video Generation
DRAW2ACT enforces geometric consistency by synthesizing RGB and depth videos together in the latent space. At training, the latents of RGB () and depth () videos are concatenated temporally:
This sequence is processed by the DiT, applying self- and cross-attention jointly over both modalities. A single loss is sufficient, with no explicit depth-supervision term required. Upon decoding, the VAE produces temporally and spatially aligned RGB/depth sequences, ensuring spatio-temporal and cross-modal fidelity.
4. Multimodal Policy Regression for Robotic Control
The synthesized (RGB, depth) demonstration video is the input to a lightweight policy regressor, predicting joint angles and gripper state:
- Both video streams are encoded with the same 3D VAE, providing modality-specific latents and .
- Patch embeddings convert these into sequences of tokens.
- Each modality proceeds through a spatial Transformer (per-frame spatial dependencies) and a temporal Transformer (temporal dynamics).
- Cross-attention allows, for example, RGB stream queries to attend to depth stream keys/values and vice versa, integrating both modalities.
- The fused feature representations are summed and decoded via a ResNet-style head, outputting (where is the number of joints) plus gripper open/close probability.
The policy is trained using an regression loss:
5. Training and Inference Pipeline
The end-to-end workflow is divided into training and inference phases.
Training Protocol
- Input: Dataset where is an RGB video; , object mask; , 3D trajectory; , text prompt; , joint angles.
- For each sample, is obtained from mask-tracking and depth estimation. overlays on and is encoded to . Object cropping plus DINOv2 yields encodes as .
- The full RGB and depth videos are encoded as ; random and Gaussian noise generate .
- The diffusion loss is backpropagated to train .
- Once converged, is frozen; it generates synthetic demonstrations , which train the downstream policy net using .
Inference Protocol
- Given a new scene , compute . Sample initial noise , and run the reverse diffusion conditioned on to obtain .
- Decode to generate synchronized RGB and depth demonstration videos .
- Input to the policy net to infer joint commands .
6. Experimental Evaluation and Ablation
DRAW2ACT was evaluated on BridgeData V2 (WidowX), Berkeley UR5, and MuJoCo-simulated Franka Panda, comprising approximately 50.6K video clips and 100 test tasks per dataset. The model was benchmarked against LeviTor, Tora, MotionCtrl, and DragAnything baselines. Evaluation metrics included:
- Video quality (VBench-2.0): Motion Smoothness, Background Consistency, Subject Consistency, Temporal Flicker.
- Trajectory Error: mean distance between ground-truth and trajectory extracted from the generated video.
- Depth-video fidelity: LPIPS (lower is better), SSIM, PSNR, FVD (all computed on depth, higher SSIM/PSNR, lower LPIPS/FVD are better).
- Downstream task success: Fraction of generated videos enabling the policy to complete a successful pick-and-place.
Key simulator results:
| Metric | DRAW2ACT | Tora |
|---|---|---|
| Motion Consistency | 0.9865 | 0.9844 |
| Object Trajectory Error | 19.88 px | 35.44 px |
| Downstream Success | 65.2% | 36.8% |
Ablation experiments demonstrated that each module’s presence contributed to superior performance: joint depth–RGB generation improved video fidelity; use of 3D trajectory reduced average error from approximately 36 px to 21 px; DINOv2 object features further decreased error to 19.9 px and improved manipulation success to 65.2%.
7. Summary and Principal Contributions
DRAW2ACT establishes a new state of the art in controllable, visually consistent, and manipulation-relevant robotic demonstration video synthesis. Its key advances are multi-stream trajectory encoding, gated DINOv2 fusion within a diffusion transformer, and spatio-temporally consistent joint RGB–depth generation. These design choices yield more accurate, stable, and manipulable demonstrations and translate directly to increased downstream robotic task performance compared to leading baselines (Bai et al., 16 Dec 2025).