RT-1-X: 35M-Param Multi-Robot Transformer

Updated 20 March 2026

RT-1-X is a 35M-parameter transformer model designed for multi-embodiment robotic manipulation with unified visual and language inputs.
It employs multi-modal fusion via FiLM to process standardized vision-language tokens, enabling effective cross-robot generalization.
The model predicts discretized 7-DoF action commands and demonstrates positive transfer across 22 robot embodiments with no explicit adaptation layers.

RT-1-X is a 35 million-parameter transformer-based model for robotic manipulation that enables multi-embodiment policy learning across a diverse set of robots and tasks. As part of the RT-X family, RT-1-X is designed to ingest visual and linguistic information from a unified data format, generating discretized end-effector actions with the goal of positive transfer across distinct robot morphologies. It builds upon the RT-1 architecture and is contrasted with the much larger RT-2-X model, which utilizes vision-language backbones. The core innovation of RT-1-X lies in its capability to be co-trained on “X-embodiment”—large-scale datasets containing demonstrations from multiple robots—with no explicit per-robot adaptation layers, leveraging multi-modal fusion and action prediction to generalize across platforms (Collaboration et al., 2023).

1. Design Objectives and Data Formulation

RT-1-X targets generalization of robot manipulation skills by consolidating policy representations into a single high-capacity transformer model. The central objective is to learn a shared policy capable of operating a spectrum of robots by exploiting positive transfer from cross-robot demonstration data. Datasets assembled for RT-1-X consist of 22 robot embodiments from 21 collaborating institutions, standardizing inputs and outputs to facilitate consistent training and evaluation. This approach is in contrast to earlier methods relying on task- or robot-specific models.

Key data curation features include:

A unified “robotics mixture” dataset representing 9 robot manipulators and 527 unique skills (160,266 tasks), aligned to a coarse 7-DoF action space.
Standardization of camera placement, visual history, and language instructions to reduce cross-embodiment variance.

RT-1-X employs a fixed input protocol that synthesizes multi-modal data into sequence tokens suitable for transformer processing.

Visual Input:

Each robot is instrumented with a single canonical RGB camera, with all frames resized uniformly (e.g., 256×256).
The model operates over the last $H=15$ images from the trajectory's history.
Each image is processed by an ImageNet-pretrained EfficientNet backbone; resulting feature maps are flattened into tokens.

Language Input:

Task instructions are given as natural language and converted to embeddings using a pretrained Universal Sentence Encoder (USE).

Multi-Modal Fusion (FiLM):

Fusion of visual and linguistic information is achieved via Feature-Wise Linear Modulation (FiLM). The USE embedding generates per-feature scale and shift parameters modulating EfficientNet feature maps pre-tokenization, resulting in approximately 81 vision-language tokens per sample.

No explicit proprioceptive or force-sensor inputs are described, with all relevant state information represented implicitly through images and action outputs. Embodiment-specific differences are handled exclusively through variation in visual inputs.

3. Transformer Architecture and Mathematical Formalism

RT-1-X adopts a causal decoder-only transformer architecture with autoregressive action prediction. Exact architectural hyperparameters such as the number of layers ( $L$ ), hidden dimensions ( $d_{model}$ ), attention heads ( $h$ ), and feedforward width ( $d_{ff}$ ) are not specified.

Positional Encoding

Standard 1D positional embeddings are used for both the concatenated vision-language token sequence and the step position in the output action sequence.

Core Operations

Multi-head self-attention:

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right)V$

$\mathrm{MultiHead}(X) = \mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_h)W^O, \quad \mathrm{head}_i = \mathrm{Attention}(XW^Q_i, XW^K_i, XW^V_i)$

Feed-forward block:

$\mathrm{FFN}(x) = W_2\,\mathrm{GELU}(W_1\,x + b_1) + b_2$

Decoder block:

$\begin{aligned} &x' = x + \mathrm{MultiHead}(\mathrm{LayerNorm}(x)),\ &x''= x' + \mathrm{FFN}(\mathrm{LayerNorm}(x')),\ &\text{output} = \mathrm{LayerNorm}(x'') \end{aligned}$

The decoder acts autoregressively over action tokens at each time step.

4. Output Space and Action Decoding

RT-1-X learns to predict a discretized 7-DoF end-effector command along with an episodic termination signal.

Action Space Structure:

Dimension	Discretization	Description
$\Delta x, \Delta y, \Delta z$	256 bins each	Translational DoF
$\Delta$ roll, pitch, yaw	256 bins each	Rotational DoF
Gripper (open/close)	256 bins	Gripper state
“Done” token	1 token	Episode termination

This yields a total output vocabulary size of $7 \times 256 + 1 = 1793$ tokens. For each time step, the transformer decodes all 8 tokens (7 actions plus “done”) sequentially via a linear projection followed by softmax.

5. Training Protocol and Cross-Embodiment Generalization

RT-1-X is trained from scratch on the coalesced robotics mixture. The loss is the cross-entropy between predicted and target action tokens for the 8-token output per time step. There is no explicit weight sharing, parameter partitioning, robot-embedding, or adapter structures; the full network must generalize context and policy across all robot instances.

Positive transfer across robot platforms is an explicit observed property: performance improves on individual robots when the model is jointly trained on multi-embodiment data. Inference runs at 3–10 Hz on physical robots.

RT-1-X’s principal architectural element is its strict architectural identity with RT-1, with the “-X” designation encoding its cross-embodiment training regimen. Unlike RT-2-X, which introduces scaling (5–55B parameters) and utilizes pre-trained vision-language backbones with action as text, RT-1-X remains in the 35M parameter regime and retains structured action tokenization. No per-robot modules or explicit adaptation layers are introduced; all adaptation arises from visual input variance and cross-domain training.

This model should be distinguished from prior single-embodiment architectures by its use of multi-embodiment training and tight data curation standards, which facilitate effective transfer and scaling across the robotics domain (Collaboration et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Open X-Embodiment: Robotic Learning Datasets and RT-X Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RT-1-X Architecture.

RT-1-X: 35M-Param Multi-Robot Transformer

1. Design Objectives and Data Formulation

2. Input Representations and Multi-Modal Encoding

3. Transformer Architecture and Mathematical Formalism

4. Output Space and Action Decoding

5. Training Protocol and Cross-Embodiment Generalization

6. Innovation and Comparison to Related Models

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics