Text-to-Action Pretraining

Updated 29 May 2026

Text-to-Action Pretraining is a paradigm that combines a vision-language backbone with a DiT-based action decoder to translate natural language into continuous control actions.
It uses multi-task objectives including flow-matching action loss and vision-language next-token loss to ensure effective grounding and generalization across diverse robotic tasks.
Empirical results demonstrate robust performance in manipulation, navigation, and dynamic tasks, validating T2A as a scalable framework for embodied AI.

Text-to-Action (T2A) Pretraining is a paradigm for building generalist embodied AI agents capable of mapping natural language instructions to continuous action trajectories, spanning diverse environments, task families, and robot embodiments. The approach is exemplified in Qwen-VLA, which unifies vision, language, and action modeling using a two-part foundation: a vision-language multimodal Transformer, and a DiT-based action decoder. T2A pretraining initially establishes a language-indexed action prior that is subsequently grounded using large-scale, multimodal datasets and embodies the capacity for transferable reasoning and control across robotic platforms (Wang et al., 28 May 2026).

1. Pretraining Objectives and Losses

The Qwen-VLA framework implements multi-task pretraining objectives by interleaving samples from manipulation, navigation, and vision-language (VL) tasks in each minibatch. The key losses are:

Flow-Matching Action Loss: The action trajectory $Y_0 \in \mathbb{R}^{H \times K}$ is stochastically interpolated with Gaussian noise $Y_1 \sim \mathcal{N}(0, I)$ , forming a noisy action chunk $Y_\tau = (1 - \tau) Y_0 + \tau Y_1$ , where $\tau \in [0, 1]$ . The DiT-based policy head predicts a velocity field $v_\theta(Y_\tau, \tau \mid o_{1:t}, x, e, z)$ . A binary mask $M \in \{0,1\}^{H \times K}$ enforces channel- and horizon-specific supervision. The per-channel MSE loss:

$\ell_k = \frac{ \sum_{h=1}^H M_{h,k} \| (v_\theta(Y_\tau, \tau|\cdots) - (Y_1 - Y_0))_{h,k} \|_2^2 }{ \sum_{h=1}^H M_{h,k} }$

is averaged to yield

$\mathcal{L}_{\rm act} = \mathbb{E}_{\tau, Y_0, Y_1} \left[ \frac{1}{c} \sum_{k=0}^{c-1} \ell_k \right].$

Vision-Language Next-Token Loss: To maintain perception and reasoning capabilities, a cross-entropy loss is applied:

$\mathcal{L}_{\rm vl} = -\sum_i \log p_\theta(w_i|w_{<i}, o_{1:t}),$

where $w_i$ are tokens from captioning, VQA, and grounding datasets.

Combined Multi-Task Objective: Balancing action and VL gradients using weights $Y_1 \sim \mathcal{N}(0, I)$ 0 and $Y_1 \sim \mathcal{N}(0, I)$ 1:

$Y_1 \sim \mathcal{N}(0, I)$ 2

2. Pretraining Data and Task Mixture

T2A relies on a heterogeneous data mix, paired at the episodic level to enable transferable mapping. The composition is as follows:

Data Family	Example Modalities	Mixture Proportion
Robot manipulation trajectories	Vision, language, actions	74.2%
Egocentric human demonstrations	Egocentric frames, language, wrist/hand pose (eigengrasps + SE(3) deltas)	6.0%
Vision-language navigation	RGB observation, spoken instruction, waypoints $Y_1 \sim \mathcal{N}(0, I)$ 3	7.5%
Synthetic simulation trajectories	Simulated images, language	3.7%
General vision-language corpora	Images, captions, QA	3.4%
2D spatial grounding	Images, referring expressions	2.5%
Autonomous-driving VQA	Onboard images, driving questions	2.4%
Embodied action captions	Language only	0.2%

This mixture ensures the model is exposed to rich embodiment, sensory, and linguistic variations, constructing a broad action-language manifold.

3. Model Architecture

The architecture comprises two principal components:

Vision-Language Backbone: Based on Qwen3.5, an early-fusion multimodal Transformer. It uses a 2D ViT for visual patch tokenization with spatial merging, integrating image and text tokens as a single sequence. Hybrid attention combines gated linear and grouped-query Softmax for efficient global-local reasoning. Positional information is encoded using 1D RoPE for text and spatial RoPE within the ViT.
DiT-Based Action Decoder: A flow-matching policy head with 16 DiT blocks ( $Y_1 \sim \mathcal{N}(0, I)$ 41.13B parameters) is responsible for generative trajectory modeling. The decoder takes as input the concatenation of VLM hidden states (from $Y_1 \sim \mathcal{N}(0, I)$ 5) and the noisy action chunk $Y_1 \sim \mathcal{N}(0, I)$ 6, supporting full attention with multi-section RoPE aligned to the backbone. Outputs are instantaneous velocity fields $Y_1 \sim \mathcal{N}(0, I)$ 7.

Additional components include MLP-based action encoder/decoder, VLM-to-DiT linear projections, timestep embedding, and AdaLN modulations.

4. Embodiment-Aware Prompt Conditioning

To unify policy control over different robot embodiments, samples are prefixed with a textual prompt $Y_1 \sim \mathcal{N}(0, I)$ 8 that characterizes the robot, its configuration, and control regime:

“The robot is {robot_tag} with {single arm / dual arms}[, waist] [, and mobile base]. The control frequency is {FPS} Hz. Please predict the next {chunk_size} control actions to execute the following task: {ori_instruction}.”

This prompt embeds robot tag, modifiers, control rate, and action chunk horizon. Prompt tokens are fed to the VLM, with the resulting hidden vectors concatenated alongside the noisy action chunk in the DiT. The prompt remains the sole embodiment signal, fully informing channel semantics, chunk length, and control conventions.

5. Multi-Stage Pretraining Recipe

Qwen-VLA employs a sequential training regimen, each stage consolidating language-to-action grounding:

Stage I (Text-to-Action DiT Pretraining): The VLM is frozen; images are withheld. The DiT is trained exclusively on language+embodiment prompt $Y_1 \sim \mathcal{N}(0, I)$ 9 action trajectories (combining synthetic and real robot data). This phase builds a compact, language-indexed action prior.
Stage II (Continued Pretraining): Both VLM and DiT are unfrozen and trained on the joint multimodal task mixture, grounding the previously learned action prior in visual context and adapting the VLM to embodied perception.
Stage III (Supervised Fine-Tuning): The model is fine-tuned from the previous checkpoint with a mixture of curated vision-language (VL), vision-language-action (VLA), and vision-language navigation (VLN) tasks, as well as real-robot teleoperation data. The loss is weighted as $Y_\tau = (1 - \tau) Y_0 + \tau Y_1$ 0.
Stage IV (Reinforcement Learning): On-policy PPO is deployed on SimplerEnv using sparse rewards. The policy surrogate objective is:

$Y_\tau = (1 - \tau) Y_0 + \tau Y_1$ 1

with $Y_\tau = (1 - \tau) Y_0 + \tau Y_1$ 2. The value head is attached to the VLM, and off-policy density estimation is realized by converting the flow-matching ODE into an SDE, resulting in explicit Gaussian log-probabilities per denoising step.

Throughout all stages, careful task mixing and dynamic up-weighting of vision-language examples prevent drift in the VLM’s semantic grounding.

6. Text-to-Action Mapping and Manifold Structure

T2A’s initial phase—training the DiT solely on language and embodied prompt—serves as a compression exercise. The network learns to decompress a sentence-plus-embodiment description into a plausible, high-dimensional action trajectory. The channel-masked action tensor $Y_\tau = (1 - \tau) Y_0 + \tau Y_1$ 3, accompanied by validity mask $Y_\tau = (1 - \tau) Y_0 + \tau Y_1$ 4, provides a scalable representation across navigation (3-DoF waypoints), manipulation (6–30 DoF joint-space), and fine-grained grasping (eigengrasps + SE(3) deltas). Masking ensures only valid trajectory elements are supervised. Flow-matching loss formally anchors the denoising process, enabling the DiT to serve as a continuous trajectory sampler at inference, integrated from $Y_\tau = (1 - \tau) Y_0 + \tau Y_1$ 5 to $Y_\tau = (1 - \tau) Y_0 + \tau Y_1$ 6 via Euler steps.

This suggests that T2A builds a coherently structured, language-indexed manifold over trajectory space, facilitating generalization across domains and robot morphologies.

7. Significance and Empirical Outcomes

Qwen-VLA's instantiation of T2A pretraining demonstrates consistent multi-task and out-of-distribution performance across manipulation, navigation, and trajectory-centric benchmarks under variations in scene layout, object configuration, lighting, and embodiment. Notable results include 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation (Wang et al., 28 May 2026). This unified approach obviates the need for separate architectures per task or robot, validating T2A as a scalable paradigm for generalist embodied intelligence.

Markdown Report Issue Upgrade to Chat

References (1)

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Text-to-Action Pretraining (T2A).

Text-to-Action Pretraining

1. Pretraining Objectives and Losses

2. Pretraining Data and Task Mixture

3. Model Architecture

4. Embodiment-Aware Prompt Conditioning

5. Multi-Stage Pretraining Recipe

6. Text-to-Action Mapping and Manifold Structure

7. Significance and Empirical Outcomes

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Text-to-Action Pretraining

1. Pretraining Objectives and Losses

2. Pretraining Data and Task Mixture

3. Model Architecture

4. Embodiment-Aware Prompt Conditioning

5. Multi-Stage Pretraining Recipe

6. Text-to-Action Mapping and Manifold Structure

7. Significance and Empirical Outcomes

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research