RT-2: Vision-Language-Action Transformer

Updated 3 March 2026

RT-2 is a large-scale vision-language-action model that transforms visual inputs and instructions into robotic control commands using text tokenization.
The architecture fuses multimodal data with ViT encoders and language backbones to generate discrete or vectorized action tokens with strong generalization.
By integrating SARA-RT’s linear attention mechanism, RT-2 achieves real-time performance and improved accuracy on diverse robotic manipulation tasks.

The RT-2 model refers to a class of large-scale vision-language-action (VLA) transformer architectures designed for robotic control, particularly those enabling direct transfer of web-scale vision and language knowledge into closed-loop robotic policies. RT-2's main innovation is casting robotic actions as discrete or vectorized text tokens, allowing the model to leverage advances in vision-language modeling and web-scale pretraining for generalization, reasoning, and semantic manipulation tasks in robotics. The model has significant relevance in both robotics and scalable neural sequence modeling.

1. Architectural Foundations

RT-2 builds upon state-of-the-art vision–LLMs (VLMs), notably the PaLI-X encoder–decoder (with up to 55B parameters) and PaLM-E (decoder-only, up to 12B parameters) (Brohan et al., 2023). The model's input pipeline comprises: (a) visual observations (one or more RGB frames, tokenized into patch embeddings), (b) a natural language instruction, and (c) an optional sequence of historical observations. These sequences are concatenated and passed through a fuser block, enabling multimodal self-attention across both image and text modalities. Each token sequence is processed using deep ViT-style encoders and language backbones with standard quadratic-time self-attention.

At output, RT-2 produces a sequence of discrete or vectorized "action tokens" representing low-level robotic commands, including 6-DoF end-effector deltas, gripper commands, and a special "stop" symbol. Action discretization employs 256 uniformly quantized bins per degree of freedom, mapped into the model's text token space, or directly serialized vector tokens (floating-point values rounded to 4 decimal places and tokenized).

2. Training Regimen and Data

RT-2 is co-fine-tuned on a hybrid mixture of internet-scale VQA/captioning datasets and robot trajectory data (notably, the RT-1 dataset, with O(10⁴–10⁵) demonstrations from 13 robots). In each training batch, robot data are up-sampled—comprising 50–66% of the minibatch—to prevent catastrophic forgetting of low-level control. The unified autoregressive next-token loss is applied to both web and robotics tokens:

$\mathcal{L}(\theta)= \mathbb{E}_{(I,Q,y)\sim D_{VL}} \left[-\sum_{t=1}^{T_{VL}} \log p_\theta(y_t | y_{<t}, I, Q)\right] + \mathbb{E}_{(I,Q,a)\sim D_A} \left[-\sum_{t=1}^8 \log p_\theta(\tilde a_t|\tilde a_{<t},I,Q)\right]$

where $D_{VL}$ is web-scale vision–language data and $D_A$ is robot trajectory/action data (Brohan et al., 2023).

3. Emergent Generalization and Semantic Capabilities

RT-2 exhibits strong emergent abilities not present in the underlying robot training data, attributable to its web-scale VLM pretraining and unified token space for actions:

Generalization to novel objects and environments: RT-2 achieves out-of-distribution "unseen" success rates of ∼62% on hard tasks (vs. 32–35% for RT-1/MOO baselines) (Brohan et al., 2023).
Semantic reasoning: The model is capable of symbol grounding, compositional reasoning ("move apple to 3", "pick up the smallest object"), and even referencing specific persons by name or attribute ("move coke can to person with glasses").
Chain-of-thought (CoT) reasoning: By prepending explicit "Plan:" natural language steps before the action tokens, RT-2 can perform multi-stage semantic reasoning (e.g., identifying a rock as an improvised hammer or choosing an energy drink for a tired user).
Language–action commingling: Actions from a continuous control space are output verbatim as text tokens or vector strings, allowing the model to learn in the same format as web LLMs.

4. Inference Efficiency and the Quadratic Attention Bottleneck

The primary implementation bottleneck in RT-2 is its reliance on standard $O(N^2)$ softmax attention in the ViT encoder, where $N$ is the token sequence length ( $N\sim 200$ –$1000+$, depending on history, image resolution, and multimodal fusion). This quadratic complexity is prohibitive for on-robot deployment with hard real-time latency bounds (≥10 Hz), especially at higher image resolutions or multi-frame histories (Leal et al., 2023).

To address this, the SARA-RT methodology enables the replacement of standard softmax attention with a provably accurate linear attention mechanism, without loss in downstream manipulation quality.

5. SARA-RT: Self-Adaptive Robust Attention for RT-2

SARA-RT applies "up-training" to transform pre-trained RT-2 models into efficient, linear-time variants suitable for real-time robotic inference. The approach replaces each softmax attention block with a self-adaptive, trainable kernel approximation:

Feature map factorization: Any positive-definite attention kernel $K(x,y)=\exp(x^\top y)$ is expressed via a learned low-dimensional feature mapping $\varphi$ : $K(x,y)\approx\varphi_1(x)^\top\varphi_2(y)$ , where $\varphi_1(z)=v\odot f(G_Q z)$ , $\varphi_2(z)=v\odot f(G_K z)$ for learned matrices $G_Q, G_K$ and nonlinearity $f$ .
Linear complexity: Attention aggregation and output for a sequence of queries/keys are computed in $O(Nm+md_{QK})$ , $m\ll N$ , enabling linear scaling with sequence length.
Up-training: Only a small number of epochs (typically 0.5–2 on downstream data) are required to retrain the new attention maps $G,v$ for the target task.
Empirical results: SARA-RT-2 preserves or improves per-task manipulation accuracy compared to RT-2 (mean accuracy: 65.8% for RT-2 vs. 65.1% for SARA-RT-2, and up to 76.4% with enriched context/action tokenization). Latency improvements of approximately 14% are observed (53.2 ms $\to$ 45.7 ms per forward pass), and higher image resolutions (up to 512×512) become feasible within real-time constraints (Leal et al., 2023).

RT-2 and SARA-RT-2: Manipulation Task Accuracy

Variant	pick	knock	open drawer	place drawer	upright	move	diverse-pick	mean
RT-2 (baseline)	81%	86%	67%	39%	57%	98%	33%	65.8%
SARA-RT-2 (linear attention)	83%	91%	78%	31%	46%	79%	48%	65.1%
SARA-RT-2 (3-frame, vector-act.)	100%	91%	89%	56%	51%	81%	67%	76.4%

Adding a 3-frame history and vectorized action tokens, made possible only by linear attention, yields a substantial mean performance increase.

6. Trade-Offs, Deployment, and Open Challenges

Key trade-offs and deployment strategies for RT-2/SARA-RT-2 include:

Accuracy/speed trade-off: SARA-RT-2 maintains manipulation task accuracy with significant speed/latency improvements. The use of vector action string tokenizers further enhances success rates.
Contextual window and history: Enabling histories of $H>1$ frames is computationally infeasible with quadratic attention but efficient in SARA-RT-2, improving context-awareness and mean accuracy by $10+$ percentage points.
Feature nonlinearity: In practice, ReLU is preferred over exponent or square-root nonlinearities in attention maps, due to variance and stability constraints.
On-robot configuration: For real-time control at $≥10$ Hz, single-frame inputs with SARA-ViT, ReLU features, and vector action tokens are suggested. Richer histories and higher image resolutions are now accessible without violating latency constraints.

RT-2's current limitations include the inability to acquire novel low-level physical skills outside the robot training distribution, the need for extremely large models and hardware (up to 55B parameters, 1–3 Hz rates on large models), and restricted availability of open-source VLMs suitable for such tokenization/fine-tuning paradigms (Brohan et al., 2023, Leal et al., 2023).

7. Significance and Future Directions

RT-2 exemplifies a major unification of web-scale semantic modeling and closed-loop robotic control. By recasting robot policy generation as VLA sequence modeling in a shared token space, RT-2 enables transfer of emergent reasoning and generalization abilities from VLMs to robots. The integration of SARA-RT further permits scaling input windows and real-time feasible deployment without loss of control fidelity.

Planned future directions include scaling robot data to encompass new classes of motion (potentially from human video), pushing deployment speeds to $≥10$ Hz through quantization/distillation, and expanding the ecosystem of open-source VLMs and fine-tuning APIs for democratized research (Brohan et al., 2023, Leal et al., 2023).

A plausible implication is that, as model and dataset scale increase and inference bottlenecks are further mitigated, VLA models in the RT-2 family will enable more robust robotic generalization, richer reasoning, and wider baseline adoption across real-world domains.

Markdown Report Issue Upgrade to Chat

References (2)

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (2023)

SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RT-2 Model.