Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 32 tok/s Pro
GPT-4o 95 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy (2503.19757v1)

Published 25 Mar 2025 in cs.RO and cs.CV

Abstract: While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer's scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces Dita, a scalable Diffusion Transformer that integrates continuous action denoising within the main causal Transformer using in-context conditioning on multimodal inputs.
  • Dita demonstrates strong zero-shot generalization on simulation benchmarks and efficient few-shot adaptation (e.g., 10-shot) for complex real-world manipulation tasks.
  • The unified architecture effectively handles heterogeneous robot action spaces and supports predicting long-horizon action sequences through pretraining on diverse datasets.

Dita introduces a scalable framework for generalist vision-language-action (VLA) policies by integrating a Diffusion Transformer (DiT) architecture for action generation (2503.19757). This approach diverges from prior VLA models that typically employ compact action heads, such as discretization bins or small MLPs/transformers for diffusion denoising, which can limit adaptability across diverse robotic platforms and action spaces. Dita proposes a unified architecture where the action denoising process is directly incorporated into a scalable causal Transformer, enabling the prediction of continuous action sequences through a diffusion process conditioned on multimodal inputs.

Architecture and In-Context Conditioning

The core of Dita is a Transformer architecture designed to perform denoising for a diffusion model operating on continuous action sequences. Unlike architectures like Octo, which condition a separate, smaller diffusion head (e.g., MLP or DiT) on fused multimodal embeddings, Dita employs in-context conditioning. This involves concatenating multiple input modalities into a single sequence processed by the main causal Transformer:

  1. Language Instruction Tokens: Text instructions are tokenized and embedded.
  2. Historical Visual Tokens: Raw visual observations are processed. Input images are divided into patches, encoded by a frozen DINOv2 ViT. A Q-Former then maps the patch features to a smaller set of visual tokens. These tokens from historical observations (e.g., the last k frames) are included in the input sequence.
  3. Diffusion Timestep Embedding: The current timestep tt of the diffusion process is encoded.
  4. Noised Action Sequence: The action sequence ata_t at diffusion step tt, represented as a sequence of continuous 7D vectors (3D translation, 3D rotation axis-angle, 1D gripper state), is tokenized and appended.

The causal Transformer processes this concatenated sequence. Crucially, the self-attention mechanism allows the prediction of the noise ϵθ(at,t,c)\epsilon_\theta(a_t, t, c) for the action tokens to directly attend to the raw visual tokens from historical observations and language instruction tokens within the same context window. This direct attention facilitates fine-grained alignment between the generated action and subtle visual cues or environmental changes captured in the observation history, bypassing potential information bottlenecks associated with pre-fused embeddings. The model predicts the noise added to the action sequence, allowing iterative refinement towards the final continuous action trajectory using a diffusion sampling process like DDPM or DDIM.

The architecture utilizes a 334M parameter Transformer (with 221M trainable parameters, excluding the frozen vision backbone). The unified nature allows the diffusion denoising component to benefit directly from the scalability inherent in Transformer models.

Handling Heterogeneous Action Spaces and Scalability

A significant challenge in generalist robotics is handling the diverse action spaces of different robots (e.g., varying kinematics, joint limits, gripper types) found in large-scale datasets like Open X-Embodiment (OXE). Discretization methods struggle with defining appropriate bins across embodiments, while continuous action prediction often relies on smaller, potentially less expressive heads.

Dita addresses this by:

  1. Direct Continuous Action Denoising: Modeling actions as sequences of continuous 7D vectors avoids the need for discretization. The diffusion model learns the distribution of these continuous actions directly from the data.
  2. Unified Transformer: The large capacity of the Transformer allows it to implicitly learn the mapping from observations (which visually suggest the embodiment) and language instructions to the appropriate continuous action distribution for that specific context. The model learns to generate suitable action ranges and dynamics across different robots represented in the OXE dataset.
  3. Scalability Synergy: By integrating the denoiser into the main Transformer, Dita allows both the perceptual understanding (via attention to visual/language tokens) and the action generation (via denoising) components to scale together with model size and data.

This unified approach enables Dita to effectively leverage cross-embodiment datasets containing variations in camera perspectives, scenes, tasks, and action spaces, learning a policy robust to these variances.

Training, Adaptation, and Long-Horizon Tasks

Dita is pretrained on the extensive OXE dataset using a standard diffusion model objective, minimizing the difference between the predicted noise and the actual noise added to the action sequence: L=Ea0,ϵN(0,I),tU(1,T)[ϵϵθ(at,t,c)2]L = E_{a_0, \epsilon \sim N(0, I), t \sim U(1, T)} [ || \epsilon - \epsilon_\theta(a_t, t, c) ||^2 ].

Key performance aspects include:

  • Zero-Shot Generalization: Pretrained Dita demonstrates strong zero-shot performance on simulation benchmarks like SimplerEnv, achieving state-of-the-art or competitive results compared to prior methods. This indicates robustness learned from the diverse pretraining data.
  • Few-Shot Adaptation: The model exhibits efficient adaptation capabilities. With only 10-shot finetuning on downstream tasks, Dita achieves robust performance in complex real-world scenarios using a Franka robot, even when trained solely on third-person camera views. This includes tasks involving precise manipulation, multi-step sequences (stacking, pick-pour, drawer interaction), and handling environmental variances (lighting, object position). The 10-shot performance often surpasses baselines, highlighting the effectiveness of the pretrained representation and architecture for rapid adaptation.
  • Long-Horizon Capabilities: The architecture naturally supports predicting action sequences or "chunks" (e.g., 16 steps ahead in simulation experiments). The in-context conditioning leverages historical visual information, aiding temporal reasoning. Dita shows strong performance on long-horizon benchmarks like LIBERO-LONG and CALVIN in simulation, as well as successfully executing multi-step tasks in the real world. Ablations suggest that predicting longer action trajectories generally improves performance.

Implementation and Deployment Considerations

  • Action Representation: Actions are represented as sequences of 7D continuous vectors: [dx, dy, dz, d_roll, d_pitch, d_yaw, gripper_state].
  • Prediction Horizon: The model predicts a sequence of future actions, typically 16 steps in simulation (H=16).
  • Inference: Denoising is performed iteratively. DDIM sampling is used for faster inference (e.g., 20 steps reported) without significant performance degradation compared to DDPM.
  • Visual Backbone: A frozen DINOv2 ViT-L/14 provides visual features, processed by a trainable Q-Former (32 latents).
  • Computational Cost: The model has 334M parameters total (221M trainable). Inference speed depends on the number of diffusion steps but is manageable (e.g., DDIM-20).
  • Open Source: The project code and models are intended to be released, providing a baseline for future research. (robodita.github.io)

The use of in-context conditioning requires careful management of sequence length, especially with long observation histories and action prediction horizons. The computational cost scales with the sequence length processed by the Transformer.

Conclusion

Dita presents a unified framework for generalist robot policy learning by employing a Diffusion Transformer that directly denoises continuous action sequences. Its key innovation, in-context conditioning, enables fine-grained attention between action generation and raw multimodal inputs, particularly historical visual tokens. This design, combined with the inherent scalability of Transformers and pretraining on diverse datasets, allows Dita to effectively handle heterogeneous action spaces, demonstrate robust zero-shot and few-shot adaptation capabilities, and successfully execute long-horizon tasks in both simulation and the real world. It offers a versatile and scalable baseline architecture for advancing VLA models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Github Logo Streamline Icon: https://streamlinehq.com