Papers
Topics
Authors
Recent
Search
2000 character limit reached

DRAW2ACT: Depth-Aware Video Synthesis

Updated 23 December 2025
  • DRAW2ACT is a depth-aware video synthesis framework that uses 3D trajectory conditioning and multi-modal diffusion to create high-fidelity robotic demonstration videos.
  • It employs orthogonal representations from RGB frames, depth cues, and DINOv2 object features to ensure robust spatio-temporal consistency in video outputs.
  • Joint synthesis of synchronized RGB–depth videos feeds a lightweight policy regressor, significantly boosting robotic manipulation success over prior baselines.

DRAW2ACT is a depth-aware, trajectory-conditioned video generation framework designed to produce high-fidelity, controllable robotic demonstration videos for manipulation tasks. Unlike prior approaches that condition on 2D trajectories or single modalities, DRAW2ACT encodes a 3D control trajectory and injects multiple, orthogonal representations—capturing depth, semantics, shape, and motion—into a video diffusion model. The system jointly synthesizes spatially aligned RGB and depth videos via a Latent Diffusion Transformer, using cross-modality attention and depth supervision to guarantee spatio-temporal consistency. A lightweight multimodal policy regressor consumes these generated videos to produce the robot’s joint commands, resulting in downstream manipulation that achieves significantly higher success rates and visual consistency compared to baseline methods (Bai et al., 16 Dec 2025).

1. Orthogonal Representation of 3D Trajectory Conditioning

At the core of DRAW2ACT is an explicit 3D “control trajectory”

q={(x0,y0,d0),,(xN1,yN1,dN1)}q = \{ (x_0, y_0, d_0), \ldots, (x_{N-1}, y_{N-1}, d_{N-1}) \}

where (xi,yi)(x_i, y_i) indicates per-frame object center pixel-coordinates and did_i the object's relative depth, computed using Video Depth Anything at each (xi,yi)(x_i, y_i). From qq, three orthogonal representations are extracted and used as conditioning inputs:

  1. Reference-Frame Latent (z0ref)(z_0^{ref}): The entire color-coded trajectory (with start/end points) is overlaid on the initial RGB frame, forming I0refI_0^{ref}. This is embedded by a pretrained 3D-causal VAE E\mathcal{E} to yield z0refR16×1×h×wz_0^{ref} \in \mathbb{R}^{16\times1\times h\times w}, which is concatenated along the temporal axis to the noisy latent ztz_t at inference for diffusion-based generation.
  2. DINOv2 Object Features (ydino)(y_{dino}): The object mask M0M_0 is segmented on the first frame via Grounded-SAM+TrackAnything. The cropped object is processed with DINOv2, producing a CdinoC_{dino}-dimensional feature, which is pasted spatially onto each frame at the corresponding trajectory location (bilinearly interpolated), aligning features with depth and suppressing background noise. This feature sequence is temporally compressed to nn frames to form ydinoRCdino×n×h×wy_{dino} \in \mathbb{R}^{C_{dino}\times n \times h \times w}, typically Cdino=1024C_{dino}=1024.
  3. Coordinate-Augmented Text Prompt (yc)(y_c): A natural-language robotic task description with explicit coordinates is encoded with a T5 model, producing ycy_c for cross-attention in the diffusion transformer.

The complete conditioning set is D={z0ref,ydino,yc}D = \{z_0^{ref}, y_{dino}, y_c\}, providing complementary cues for depth, semantics, and high-level intent.

2. Latent Diffusion Transformer with Multi-Modal Conditioning

The generative core is a latent diffusion model leveraging a DiT denoiser ϵθ\epsilon_\theta. The VAE encoder E\mathcal{E} maps a ground-truth video VR3×N×H×WV \in \mathbb{R}^{3\times N\times H\times W} to a latent z0z_0; a forward SDE corrupts it to ztz_t, and ϵθ\epsilon_\theta is trained to predict the added noise. The objective is:

Ldiffusion=Ez0,D,ϵN(0,I),tϵϵθ(zt,D,t)22L_{diffusion} = \mathbb{E}_{z_0, D, \epsilon \sim \mathcal{N}(0, I), t} \| \epsilon - \epsilon_\theta(z_t, D, t) \|_2^2

Within each DiT block, conditioning is performed via:

  • Self-attention on z0refz_0^{ref} to propagate depth and motion cues across all time steps.
  • Cross-attention to ycy_c: queries from DiT hidden states, keys/values from the text prompt encoder outputs.
  • Gated residual fusion of ydinoy_{dino}: a gating vector G=σ(Wgydino+bg)G = \sigma(W_g y_{dino} + b_g) modulates the object features and injects them into the hidden states at every block:

h=h+LayerNorm(ydinoG)(ydinoG)h' = h + \text{LayerNorm}(y_{dino} \odot G) \odot (y_{dino} \odot G)

where hh is the flattened spatio-temporal hidden state, \odot denotes elementwise product, and σ\sigma is the sigmoid.

This architecture supports sustained semantic, spatial, and temporal alignment of generated content with the input trajectory and object.

3. Joint RGB and Depth Video Generation

DRAW2ACT enforces geometric consistency by synthesizing RGB and depth videos together in the latent space. At training, the latents of RGB (zrgbz_{rgb}) and depth (zdepthz_{depth}) videos are concatenated temporally:

[zrgbzdepth]R16×2n×h×w[z_{rgb} \parallel z_{depth}] \in \mathbb{R}^{16 \times 2n \times h \times w}

This sequence is processed by the DiT, applying self- and cross-attention jointly over both modalities. A single loss LdiffusionL_{diffusion} is sufficient, with no explicit depth-supervision term required. Upon decoding, the VAE E1\mathcal{E}^{-1} produces temporally and spatially aligned RGB/depth sequences, ensuring spatio-temporal and cross-modal fidelity.

4. Multimodal Policy Regression for Robotic Control

The synthesized (RGB, depth) demonstration video is the input to a lightweight policy regressor, predicting joint angles and gripper state:

  • Both video streams are encoded with the same 3D VAE, providing modality-specific latents rgb\ell_{rgb} and depth\ell_{depth}.
  • Patch embeddings convert these into sequences of tokens.
  • Each modality proceeds through a spatial Transformer (per-frame spatial dependencies) and a temporal Transformer (temporal dynamics).
  • Cross-attention allows, for example, RGB stream queries to attend to depth stream keys/values and vice versa, integrating both modalities.
  • The fused feature representations are summed and decoded via a ResNet-style head, outputting a^RK\hat{a} \in \mathbb{R}^K (where KK is the number of joints) plus gripper open/close probability.

The policy is trained using an 2\ell_2 regression loss:

Lpolicy=E(V,a)a^(V)a22L_{policy} = \mathbb{E}_{(V, a^*)} \|\hat{a}(V) - a^*\|_2^2

5. Training and Inference Pipeline

The end-to-end workflow is divided into training and inference phases.

Training Protocol

  • Input: Dataset {(Vi,Mi,qi,ci,ai)}\{(V_i, M_i, q_i, c_i, a_i)\} where ViV_i is an RGB video; MiM_i, object mask; qiq_i, 3D trajectory; cic_i, text prompt; aia_i, joint angles.
  • For each sample, qq is obtained from mask-tracking and depth estimation. I0refI_0^{ref} overlays qq on V0V_0 and is encoded to z0refz_0^{ref}. Object cropping plus DINOv2 yields ydino;T5y_{dino}; T5 encodes cic_i as ycy_c.
  • The full RGB and depth videos are encoded as z0z_0; random tt and Gaussian noise ϵ\epsilon generate ztz_t.
  • The diffusion loss LdiffusionL_{diffusion} is backpropagated to train ϵθ\epsilon_\theta.
  • Once converged, ϵθ\epsilon_\theta is frozen; it generates synthetic demonstrations V~, which train the downstream policy net using LpolicyL_{policy}.

Inference Protocol

  • Given a new scene (I0,M,q,c)(I_0, M, q, c), compute (z0ref,ydino,yc)(z_0^{ref}, y_{dino}, y_c). Sample initial noise zTNz_T \sim \mathcal{N}, and run the reverse diffusion conditioned on {z0ref,ydino,yc}\{z_0^{ref}, y_{dino}, y_c\} to obtain z0z_0.
  • Decode z0z_0 to generate synchronized RGB and depth demonstration videos V~.
  • Input V~ to the policy net to infer joint commands a^\hat{a}.

6. Experimental Evaluation and Ablation

DRAW2ACT was evaluated on BridgeData V2 (WidowX), Berkeley UR5, and MuJoCo-simulated Franka Panda, comprising approximately 50.6K video clips and 100 test tasks per dataset. The model was benchmarked against LeviTor, Tora, MotionCtrl, and DragAnything baselines. Evaluation metrics included:

  • Video quality (VBench-2.0): Motion Smoothness, Background Consistency, Subject Consistency, Temporal Flicker.
  • Trajectory Error: mean L1L_1 distance between ground-truth qq and trajectory extracted from the generated video.
  • Depth-video fidelity: LPIPS (lower is better), SSIM, PSNR, FVD (all computed on depth, higher SSIM/PSNR, lower LPIPS/FVD are better).
  • Downstream task success: Fraction of generated videos enabling the policy to complete a successful pick-and-place.

Key simulator results:

Metric DRAW2ACT Tora
Motion Consistency 0.9865 0.9844
Object Trajectory Error 19.88 px 35.44 px
Downstream Success 65.2% 36.8%

Ablation experiments demonstrated that each module’s presence contributed to superior performance: joint depth–RGB generation improved video fidelity; use of 3D trajectory reduced average error from approximately 36 px to 21 px; DINOv2 object features further decreased error to 19.9 px and improved manipulation success to 65.2%.

7. Summary and Principal Contributions

DRAW2ACT establishes a new state of the art in controllable, visually consistent, and manipulation-relevant robotic demonstration video synthesis. Its key advances are multi-stream trajectory encoding, gated DINOv2 fusion within a diffusion transformer, and spatio-temporally consistent joint RGB–depth generation. These design choices yield more accurate, stable, and manipulable demonstrations and translate directly to increased downstream robotic task performance compared to leading baselines (Bai et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DRAW2ACT.