Papers
Topics
Authors
Recent
Search
2000 character limit reached

Text-to-Action Pretraining

Updated 29 May 2026
  • Text-to-Action Pretraining is a paradigm that combines a vision-language backbone with a DiT-based action decoder to translate natural language into continuous control actions.
  • It uses multi-task objectives including flow-matching action loss and vision-language next-token loss to ensure effective grounding and generalization across diverse robotic tasks.
  • Empirical results demonstrate robust performance in manipulation, navigation, and dynamic tasks, validating T2A as a scalable framework for embodied AI.

Text-to-Action (T2A) Pretraining is a paradigm for building generalist embodied AI agents capable of mapping natural language instructions to continuous action trajectories, spanning diverse environments, task families, and robot embodiments. The approach is exemplified in Qwen-VLA, which unifies vision, language, and action modeling using a two-part foundation: a vision-language multimodal Transformer, and a DiT-based action decoder. T2A pretraining initially establishes a language-indexed action prior that is subsequently grounded using large-scale, multimodal datasets and embodies the capacity for transferable reasoning and control across robotic platforms (Wang et al., 28 May 2026).

1. Pretraining Objectives and Losses

The Qwen-VLA framework implements multi-task pretraining objectives by interleaving samples from manipulation, navigation, and vision-language (VL) tasks in each minibatch. The key losses are:

  • Flow-Matching Action Loss: The action trajectory Y0RH×KY_0 \in \mathbb{R}^{H \times K} is stochastically interpolated with Gaussian noise Y1N(0,I)Y_1 \sim \mathcal{N}(0, I), forming a noisy action chunk Yτ=(1τ)Y0+τY1Y_\tau = (1 - \tau) Y_0 + \tau Y_1, where τ[0,1]\tau \in [0, 1]. The DiT-based policy head predicts a velocity field vθ(Yτ,τo1:t,x,e,z)v_\theta(Y_\tau, \tau \mid o_{1:t}, x, e, z). A binary mask M{0,1}H×KM \in \{0,1\}^{H \times K} enforces channel- and horizon-specific supervision. The per-channel MSE loss:

k=h=1HMh,k(vθ(Yτ,τ)(Y1Y0))h,k22h=1HMh,k\ell_k = \frac{ \sum_{h=1}^H M_{h,k} \| (v_\theta(Y_\tau, \tau|\cdots) - (Y_1 - Y_0))_{h,k} \|_2^2 }{ \sum_{h=1}^H M_{h,k} }

is averaged to yield

Lact=Eτ,Y0,Y1[1ck=0c1k].\mathcal{L}_{\rm act} = \mathbb{E}_{\tau, Y_0, Y_1} \left[ \frac{1}{c} \sum_{k=0}^{c-1} \ell_k \right].

  • Vision-Language Next-Token Loss: To maintain perception and reasoning capabilities, a cross-entropy loss is applied:

Lvl=ilogpθ(wiw<i,o1:t),\mathcal{L}_{\rm vl} = -\sum_i \log p_\theta(w_i|w_{<i}, o_{1:t}),

where wiw_i are tokens from captioning, VQA, and grounding datasets.

  • Combined Multi-Task Objective: Balancing action and VL gradients using weights Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)0 and Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)1:

Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)2

2. Pretraining Data and Task Mixture

T2A relies on a heterogeneous data mix, paired at the episodic level to enable transferable mapping. The composition is as follows:

Data Family Example Modalities Mixture Proportion
Robot manipulation trajectories Vision, language, actions 74.2%
Egocentric human demonstrations Egocentric frames, language, wrist/hand pose (eigengrasps + SE(3) deltas) 6.0%
Vision-language navigation RGB observation, spoken instruction, waypoints Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)3 7.5%
Synthetic simulation trajectories Simulated images, language 3.7%
General vision-language corpora Images, captions, QA 3.4%
2D spatial grounding Images, referring expressions 2.5%
Autonomous-driving VQA Onboard images, driving questions 2.4%
Embodied action captions Language only 0.2%

This mixture ensures the model is exposed to rich embodiment, sensory, and linguistic variations, constructing a broad action-language manifold.

3. Model Architecture

The architecture comprises two principal components:

  • Vision-Language Backbone: Based on Qwen3.5, an early-fusion multimodal Transformer. It uses a 2D ViT for visual patch tokenization with spatial merging, integrating image and text tokens as a single sequence. Hybrid attention combines gated linear and grouped-query Softmax for efficient global-local reasoning. Positional information is encoded using 1D RoPE for text and spatial RoPE within the ViT.
  • DiT-Based Action Decoder: A flow-matching policy head with 16 DiT blocks (Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)41.13B parameters) is responsible for generative trajectory modeling. The decoder takes as input the concatenation of VLM hidden states (from Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)5) and the noisy action chunk Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)6, supporting full attention with multi-section RoPE aligned to the backbone. Outputs are instantaneous velocity fields Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)7.

Additional components include MLP-based action encoder/decoder, VLM-to-DiT linear projections, timestep embedding, and AdaLN modulations.

4. Embodiment-Aware Prompt Conditioning

To unify policy control over different robot embodiments, samples are prefixed with a textual prompt Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)8 that characterizes the robot, its configuration, and control regime:

“The robot is {robot_tag} with {single arm / dual arms}[, waist] [, and mobile base]. The control frequency is {FPS} Hz. Please predict the next {chunk_size} control actions to execute the following task: {ori_instruction}.”

This prompt embeds robot tag, modifiers, control rate, and action chunk horizon. Prompt tokens are fed to the VLM, with the resulting hidden vectors concatenated alongside the noisy action chunk in the DiT. The prompt remains the sole embodiment signal, fully informing channel semantics, chunk length, and control conventions.

5. Multi-Stage Pretraining Recipe

Qwen-VLA employs a sequential training regimen, each stage consolidating language-to-action grounding:

  1. Stage I (Text-to-Action DiT Pretraining): The VLM is frozen; images are withheld. The DiT is trained exclusively on language+embodiment prompt Y1N(0,I)Y_1 \sim \mathcal{N}(0, I)9 action trajectories (combining synthetic and real robot data). This phase builds a compact, language-indexed action prior.
  2. Stage II (Continued Pretraining): Both VLM and DiT are unfrozen and trained on the joint multimodal task mixture, grounding the previously learned action prior in visual context and adapting the VLM to embodied perception.
  3. Stage III (Supervised Fine-Tuning): The model is fine-tuned from the previous checkpoint with a mixture of curated vision-language (VL), vision-language-action (VLA), and vision-language navigation (VLN) tasks, as well as real-robot teleoperation data. The loss is weighted as Yτ=(1τ)Y0+τY1Y_\tau = (1 - \tau) Y_0 + \tau Y_10.
  4. Stage IV (Reinforcement Learning): On-policy PPO is deployed on SimplerEnv using sparse rewards. The policy surrogate objective is:

Yτ=(1τ)Y0+τY1Y_\tau = (1 - \tau) Y_0 + \tau Y_11

with Yτ=(1τ)Y0+τY1Y_\tau = (1 - \tau) Y_0 + \tau Y_12. The value head is attached to the VLM, and off-policy density estimation is realized by converting the flow-matching ODE into an SDE, resulting in explicit Gaussian log-probabilities per denoising step.

Throughout all stages, careful task mixing and dynamic up-weighting of vision-language examples prevent drift in the VLM’s semantic grounding.

6. Text-to-Action Mapping and Manifold Structure

T2A’s initial phase—training the DiT solely on language and embodied prompt—serves as a compression exercise. The network learns to decompress a sentence-plus-embodiment description into a plausible, high-dimensional action trajectory. The channel-masked action tensor Yτ=(1τ)Y0+τY1Y_\tau = (1 - \tau) Y_0 + \tau Y_13, accompanied by validity mask Yτ=(1τ)Y0+τY1Y_\tau = (1 - \tau) Y_0 + \tau Y_14, provides a scalable representation across navigation (3-DoF waypoints), manipulation (6–30 DoF joint-space), and fine-grained grasping (eigengrasps + SE(3) deltas). Masking ensures only valid trajectory elements are supervised. Flow-matching loss formally anchors the denoising process, enabling the DiT to serve as a continuous trajectory sampler at inference, integrated from Yτ=(1τ)Y0+τY1Y_\tau = (1 - \tau) Y_0 + \tau Y_15 to Yτ=(1τ)Y0+τY1Y_\tau = (1 - \tau) Y_0 + \tau Y_16 via Euler steps.

This suggests that T2A builds a coherently structured, language-indexed manifold over trajectory space, facilitating generalization across domains and robot morphologies.

7. Significance and Empirical Outcomes

Qwen-VLA's instantiation of T2A pretraining demonstrates consistent multi-task and out-of-distribution performance across manipulation, navigation, and trajectory-centric benchmarks under variations in scene layout, object configuration, lighting, and embodiment. Notable results include 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation (Wang et al., 28 May 2026). This unified approach obviates the need for separate architectures per task or robot, validating T2A as a scalable paradigm for generalist embodied intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Text-to-Action Pretraining (T2A).