Text-to-Action Pretraining
- Text-to-Action Pretraining is a paradigm that combines a vision-language backbone with a DiT-based action decoder to translate natural language into continuous control actions.
- It uses multi-task objectives including flow-matching action loss and vision-language next-token loss to ensure effective grounding and generalization across diverse robotic tasks.
- Empirical results demonstrate robust performance in manipulation, navigation, and dynamic tasks, validating T2A as a scalable framework for embodied AI.
Text-to-Action (T2A) Pretraining is a paradigm for building generalist embodied AI agents capable of mapping natural language instructions to continuous action trajectories, spanning diverse environments, task families, and robot embodiments. The approach is exemplified in Qwen-VLA, which unifies vision, language, and action modeling using a two-part foundation: a vision-language multimodal Transformer, and a DiT-based action decoder. T2A pretraining initially establishes a language-indexed action prior that is subsequently grounded using large-scale, multimodal datasets and embodies the capacity for transferable reasoning and control across robotic platforms (Wang et al., 28 May 2026).
1. Pretraining Objectives and Losses
The Qwen-VLA framework implements multi-task pretraining objectives by interleaving samples from manipulation, navigation, and vision-language (VL) tasks in each minibatch. The key losses are:
- Flow-Matching Action Loss: The action trajectory is stochastically interpolated with Gaussian noise , forming a noisy action chunk , where . The DiT-based policy head predicts a velocity field . A binary mask enforces channel- and horizon-specific supervision. The per-channel MSE loss:
is averaged to yield
- Vision-Language Next-Token Loss: To maintain perception and reasoning capabilities, a cross-entropy loss is applied:
where are tokens from captioning, VQA, and grounding datasets.
- Combined Multi-Task Objective: Balancing action and VL gradients using weights 0 and 1:
2
2. Pretraining Data and Task Mixture
T2A relies on a heterogeneous data mix, paired at the episodic level to enable transferable mapping. The composition is as follows:
| Data Family | Example Modalities | Mixture Proportion |
|---|---|---|
| Robot manipulation trajectories | Vision, language, actions | 74.2% |
| Egocentric human demonstrations | Egocentric frames, language, wrist/hand pose (eigengrasps + SE(3) deltas) | 6.0% |
| Vision-language navigation | RGB observation, spoken instruction, waypoints 3 | 7.5% |
| Synthetic simulation trajectories | Simulated images, language | 3.7% |
| General vision-language corpora | Images, captions, QA | 3.4% |
| 2D spatial grounding | Images, referring expressions | 2.5% |
| Autonomous-driving VQA | Onboard images, driving questions | 2.4% |
| Embodied action captions | Language only | 0.2% |
This mixture ensures the model is exposed to rich embodiment, sensory, and linguistic variations, constructing a broad action-language manifold.
3. Model Architecture
The architecture comprises two principal components:
- Vision-Language Backbone: Based on Qwen3.5, an early-fusion multimodal Transformer. It uses a 2D ViT for visual patch tokenization with spatial merging, integrating image and text tokens as a single sequence. Hybrid attention combines gated linear and grouped-query Softmax for efficient global-local reasoning. Positional information is encoded using 1D RoPE for text and spatial RoPE within the ViT.
- DiT-Based Action Decoder: A flow-matching policy head with 16 DiT blocks (41.13B parameters) is responsible for generative trajectory modeling. The decoder takes as input the concatenation of VLM hidden states (from 5) and the noisy action chunk 6, supporting full attention with multi-section RoPE aligned to the backbone. Outputs are instantaneous velocity fields 7.
Additional components include MLP-based action encoder/decoder, VLM-to-DiT linear projections, timestep embedding, and AdaLN modulations.
4. Embodiment-Aware Prompt Conditioning
To unify policy control over different robot embodiments, samples are prefixed with a textual prompt 8 that characterizes the robot, its configuration, and control regime:
“The robot is {robot_tag} with {single arm / dual arms}[, waist] [, and mobile base]. The control frequency is {FPS} Hz. Please predict the next {chunk_size} control actions to execute the following task: {ori_instruction}.”
This prompt embeds robot tag, modifiers, control rate, and action chunk horizon. Prompt tokens are fed to the VLM, with the resulting hidden vectors concatenated alongside the noisy action chunk in the DiT. The prompt remains the sole embodiment signal, fully informing channel semantics, chunk length, and control conventions.
5. Multi-Stage Pretraining Recipe
Qwen-VLA employs a sequential training regimen, each stage consolidating language-to-action grounding:
- Stage I (Text-to-Action DiT Pretraining): The VLM is frozen; images are withheld. The DiT is trained exclusively on language+embodiment prompt 9 action trajectories (combining synthetic and real robot data). This phase builds a compact, language-indexed action prior.
- Stage II (Continued Pretraining): Both VLM and DiT are unfrozen and trained on the joint multimodal task mixture, grounding the previously learned action prior in visual context and adapting the VLM to embodied perception.
- Stage III (Supervised Fine-Tuning): The model is fine-tuned from the previous checkpoint with a mixture of curated vision-language (VL), vision-language-action (VLA), and vision-language navigation (VLN) tasks, as well as real-robot teleoperation data. The loss is weighted as 0.
- Stage IV (Reinforcement Learning): On-policy PPO is deployed on SimplerEnv using sparse rewards. The policy surrogate objective is:
1
with 2. The value head is attached to the VLM, and off-policy density estimation is realized by converting the flow-matching ODE into an SDE, resulting in explicit Gaussian log-probabilities per denoising step.
Throughout all stages, careful task mixing and dynamic up-weighting of vision-language examples prevent drift in the VLM’s semantic grounding.
6. Text-to-Action Mapping and Manifold Structure
T2A’s initial phase—training the DiT solely on language and embodied prompt—serves as a compression exercise. The network learns to decompress a sentence-plus-embodiment description into a plausible, high-dimensional action trajectory. The channel-masked action tensor 3, accompanied by validity mask 4, provides a scalable representation across navigation (3-DoF waypoints), manipulation (6–30 DoF joint-space), and fine-grained grasping (eigengrasps + SE(3) deltas). Masking ensures only valid trajectory elements are supervised. Flow-matching loss formally anchors the denoising process, enabling the DiT to serve as a continuous trajectory sampler at inference, integrated from 5 to 6 via Euler steps.
This suggests that T2A builds a coherently structured, language-indexed manifold over trajectory space, facilitating generalization across domains and robot morphologies.
7. Significance and Empirical Outcomes
Qwen-VLA's instantiation of T2A pretraining demonstrates consistent multi-task and out-of-distribution performance across manipulation, navigation, and trajectory-centric benchmarks under variations in scene layout, object configuration, lighting, and embodiment. Notable results include 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation (Wang et al., 28 May 2026). This unified approach obviates the need for separate architectures per task or robot, validating T2A as a scalable paradigm for generalist embodied intelligence.