DRAW2ACT: Depth-Aware Video Synthesis

Updated 23 December 2025

DRAW2ACT is a depth-aware video synthesis framework that uses 3D trajectory conditioning and multi-modal diffusion to create high-fidelity robotic demonstration videos.
It employs orthogonal representations from RGB frames, depth cues, and DINOv2 object features to ensure robust spatio-temporal consistency in video outputs.
Joint synthesis of synchronized RGB–depth videos feeds a lightweight policy regressor, significantly boosting robotic manipulation success over prior baselines.

DRAW2ACT is a depth-aware, trajectory-conditioned video generation framework designed to produce high-fidelity, controllable robotic demonstration videos for manipulation tasks. Unlike prior approaches that condition on 2D trajectories or single modalities, DRAW2ACT encodes a 3D control trajectory and injects multiple, orthogonal representations—capturing depth, semantics, shape, and motion—into a video diffusion model. The system jointly synthesizes spatially aligned RGB and depth videos via a Latent Diffusion Transformer, using cross-modality attention and depth supervision to guarantee spatio-temporal consistency. A lightweight multimodal policy regressor consumes these generated videos to produce the robot’s joint commands, resulting in downstream manipulation that achieves significantly higher success rates and visual consistency compared to baseline methods (Bai et al., 16 Dec 2025).

1. Orthogonal Representation of 3D Trajectory Conditioning

At the core of DRAW2ACT is an explicit 3D “control trajectory”

$q = \{ (x_0, y_0, d_0), \ldots, (x_{N-1}, y_{N-1}, d_{N-1}) \}$

where $(x_i, y_i)$ indicates per-frame object center pixel-coordinates and $d_i$ the object's relative depth, computed using Video Depth Anything at each $(x_i, y_i)$ . From $q$ , three orthogonal representations are extracted and used as conditioning inputs:

Reference-Frame Latent $(z_0^{ref})$ : The entire color-coded trajectory (with start/end points) is overlaid on the initial RGB frame, forming $I_0^{ref}$ . This is embedded by a pretrained 3D-causal VAE $\mathcal{E}$ to yield $z_0^{ref} \in \mathbb{R}^{16\times1\times h\times w}$ , which is concatenated along the temporal axis to the noisy latent $z_t$ at inference for diffusion-based generation.
DINOv2 Object Features $(y_{dino})$ : The object mask $M_0$ is segmented on the first frame via Grounded-SAM+TrackAnything. The cropped object is processed with DINOv2, producing a $C_{dino}$ -dimensional feature, which is pasted spatially onto each frame at the corresponding trajectory location (bilinearly interpolated), aligning features with depth and suppressing background noise. This feature sequence is temporally compressed to $n$ frames to form $y_{dino} \in \mathbb{R}^{C_{dino}\times n \times h \times w}$ , typically $C_{dino}=1024$ .
Coordinate-Augmented Text Prompt $(y_c)$ : A natural-language robotic task description with explicit coordinates is encoded with a T5 model, producing $y_c$ for cross-attention in the diffusion transformer.

The complete conditioning set is $D = \{z_0^{ref}, y_{dino}, y_c\}$ , providing complementary cues for depth, semantics, and high-level intent.

The generative core is a latent diffusion model leveraging a DiT denoiser $\epsilon_\theta$ . The VAE encoder $\mathcal{E}$ maps a ground-truth video $V \in \mathbb{R}^{3\times N\times H\times W}$ to a latent $z_0$ ; a forward SDE corrupts it to $z_t$ , and $\epsilon_\theta$ is trained to predict the added noise. The objective is:

$L_{diffusion} = \mathbb{E}_{z_0, D, \epsilon \sim \mathcal{N}(0, I), t} \| \epsilon - \epsilon_\theta(z_t, D, t) \|_2^2$

Within each DiT block, conditioning is performed via:

Self-attention on $z_0^{ref}$ to propagate depth and motion cues across all time steps.
Cross-attention to $y_c$ : queries from DiT hidden states, keys/values from the text prompt encoder outputs.
Gated residual fusion of $y_{dino}$ : a gating vector $G = \sigma(W_g y_{dino} + b_g)$ modulates the object features and injects them into the hidden states at every block:

$h' = h + \text{LayerNorm}(y_{dino} \odot G) \odot (y_{dino} \odot G)$

where $h$ is the flattened spatio-temporal hidden state, $\odot$ denotes elementwise product, and $\sigma$ is the sigmoid.

This architecture supports sustained semantic, spatial, and temporal alignment of generated content with the input trajectory and object.

3. Joint RGB and Depth Video Generation

DRAW2ACT enforces geometric consistency by synthesizing RGB and depth videos together in the latent space. At training, the latents of RGB ( $z_{rgb}$ ) and depth ( $z_{depth}$ ) videos are concatenated temporally:

$[z_{rgb} \parallel z_{depth}] \in \mathbb{R}^{16 \times 2n \times h \times w}$

This sequence is processed by the DiT, applying self- and cross-attention jointly over both modalities. A single loss $L_{diffusion}$ is sufficient, with no explicit depth-supervision term required. Upon decoding, the VAE $\mathcal{E}^{-1}$ produces temporally and spatially aligned RGB/depth sequences, ensuring spatio-temporal and cross-modal fidelity.

4. Multimodal Policy Regression for Robotic Control

The synthesized (RGB, depth) demonstration video is the input to a lightweight policy regressor, predicting joint angles and gripper state:

Both video streams are encoded with the same 3D VAE, providing modality-specific latents $\ell_{rgb}$ and $\ell_{depth}$ .
Patch embeddings convert these into sequences of tokens.
Each modality proceeds through a spatial Transformer (per-frame spatial dependencies) and a temporal Transformer (temporal dynamics).
Cross-attention allows, for example, RGB stream queries to attend to depth stream keys/values and vice versa, integrating both modalities.
The fused feature representations are summed and decoded via a ResNet-style head, outputting $\hat{a} \in \mathbb{R}^K$ (where $K$ is the number of joints) plus gripper open/close probability.

The policy is trained using an $\ell_2$ regression loss:

$L_{policy} = \mathbb{E}_{(V, a^*)} \|\hat{a}(V) - a^*\|_2^2$

5. Training and Inference Pipeline

The end-to-end workflow is divided into training and inference phases.

Training Protocol

Input: Dataset $\{(V_i, M_i, q_i, c_i, a_i)\}$ where $V_i$ is an RGB video; $M_i$ , object mask; $q_i$ , 3D trajectory; $c_i$ , text prompt; $a_i$ , joint angles.
For each sample, $q$ is obtained from mask-tracking and depth estimation. $I_0^{ref}$ overlays $q$ on $V_0$ and is encoded to $z_0^{ref}$ . Object cropping plus DINOv2 yields $y_{dino}; T5$ encodes $c_i$ as $y_c$ .
The full RGB and depth videos are encoded as $z_0$ ; random $t$ and Gaussian noise $\epsilon$ generate $z_t$ .
The diffusion loss $L_{diffusion}$ is backpropagated to train $\epsilon_\theta$ .
Once converged, $\epsilon_\theta$ is frozen; it generates synthetic demonstrations $Ṽ$ , which train the downstream policy net using $L_{policy}$ .

Inference Protocol

Given a new scene $(I_0, M, q, c)$ , compute $(z_0^{ref}, y_{dino}, y_c)$ . Sample initial noise $z_T \sim \mathcal{N}$ , and run the reverse diffusion conditioned on $\{z_0^{ref}, y_{dino}, y_c\}$ to obtain $z_0$ .
Decode $z_0$ to generate synchronized RGB and depth demonstration videos $Ṽ$ .
Input $Ṽ$ to the policy net to infer joint commands $\hat{a}$ .

6. Experimental Evaluation and Ablation

DRAW2ACT was evaluated on BridgeData V2 (WidowX), Berkeley UR5, and MuJoCo-simulated Franka Panda, comprising approximately 50.6K video clips and 100 test tasks per dataset. The model was benchmarked against LeviTor, Tora, MotionCtrl, and DragAnything baselines. Evaluation metrics included:

Video quality (VBench-2.0): Motion Smoothness, Background Consistency, Subject Consistency, Temporal Flicker.
Trajectory Error: mean $L_1$ distance between ground-truth $q$ and trajectory extracted from the generated video.
Depth-video fidelity: LPIPS (lower is better), SSIM, PSNR, FVD (all computed on depth, higher SSIM/PSNR, lower LPIPS/FVD are better).
Downstream task success: Fraction of generated videos enabling the policy to complete a successful pick-and-place.

Key simulator results:

Metric	DRAW2ACT	Tora
Motion Consistency	0.9865	0.9844
Object Trajectory Error	19.88 px	35.44 px
Downstream Success	65.2%	36.8%

Ablation experiments demonstrated that each module’s presence contributed to superior performance: joint depth–RGB generation improved video fidelity; use of 3D trajectory reduced average error from approximately 36 px to 21 px; DINOv2 object features further decreased error to 19.9 px and improved manipulation success to 65.2%.

7. Summary and Principal Contributions

DRAW2ACT establishes a new state of the art in controllable, visually consistent, and manipulation-relevant robotic demonstration video synthesis. Its key advances are multi-stream trajectory encoding, gated DINOv2 fusion within a diffusion transformer, and spatio-temporally consistent joint RGB–depth generation. These design choices yield more accurate, stable, and manipulable demonstrations and translate directly to increased downstream robotic task performance compared to leading baselines (Bai et al., 16 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DRAW2ACT.