Papers
Topics
Authors
Recent
Search
2000 character limit reached

RT-1-X: 35M-Param Multi-Robot Transformer

Updated 20 March 2026
  • RT-1-X is a 35M-parameter transformer model designed for multi-embodiment robotic manipulation with unified visual and language inputs.
  • It employs multi-modal fusion via FiLM to process standardized vision-language tokens, enabling effective cross-robot generalization.
  • The model predicts discretized 7-DoF action commands and demonstrates positive transfer across 22 robot embodiments with no explicit adaptation layers.

RT-1-X is a 35 million-parameter transformer-based model for robotic manipulation that enables multi-embodiment policy learning across a diverse set of robots and tasks. As part of the RT-X family, RT-1-X is designed to ingest visual and linguistic information from a unified data format, generating discretized end-effector actions with the goal of positive transfer across distinct robot morphologies. It builds upon the RT-1 architecture and is contrasted with the much larger RT-2-X model, which utilizes vision-language backbones. The core innovation of RT-1-X lies in its capability to be co-trained on “X-embodiment”—large-scale datasets containing demonstrations from multiple robots—with no explicit per-robot adaptation layers, leveraging multi-modal fusion and action prediction to generalize across platforms (Collaboration et al., 2023).

1. Design Objectives and Data Formulation

RT-1-X targets generalization of robot manipulation skills by consolidating policy representations into a single high-capacity transformer model. The central objective is to learn a shared policy capable of operating a spectrum of robots by exploiting positive transfer from cross-robot demonstration data. Datasets assembled for RT-1-X consist of 22 robot embodiments from 21 collaborating institutions, standardizing inputs and outputs to facilitate consistent training and evaluation. This approach is in contrast to earlier methods relying on task- or robot-specific models.

Key data curation features include:

  • A unified “robotics mixture” dataset representing 9 robot manipulators and 527 unique skills (160,266 tasks), aligned to a coarse 7-DoF action space.
  • Standardization of camera placement, visual history, and language instructions to reduce cross-embodiment variance.

2. Input Representations and Multi-Modal Encoding

RT-1-X employs a fixed input protocol that synthesizes multi-modal data into sequence tokens suitable for transformer processing.

Visual Input:

  • Each robot is instrumented with a single canonical RGB camera, with all frames resized uniformly (e.g., 256×256).
  • The model operates over the last H=15H=15 images from the trajectory's history.
  • Each image is processed by an ImageNet-pretrained EfficientNet backbone; resulting feature maps are flattened into tokens.

Language Input:

  • Task instructions are given as natural language and converted to embeddings using a pretrained Universal Sentence Encoder (USE).

Multi-Modal Fusion (FiLM):

  • Fusion of visual and linguistic information is achieved via Feature-Wise Linear Modulation (FiLM). The USE embedding generates per-feature scale and shift parameters modulating EfficientNet feature maps pre-tokenization, resulting in approximately 81 vision-language tokens per sample.

No explicit proprioceptive or force-sensor inputs are described, with all relevant state information represented implicitly through images and action outputs. Embodiment-specific differences are handled exclusively through variation in visual inputs.

3. Transformer Architecture and Mathematical Formalism

RT-1-X adopts a causal decoder-only transformer architecture with autoregressive action prediction. Exact architectural hyperparameters such as the number of layers (LL), hidden dimensions (dmodeld_{model}), attention heads (hh), and feedforward width (dffd_{ff}) are not specified.

Positional Encoding

  • Standard 1D positional embeddings are used for both the concatenated vision-language token sequence and the step position in the output action sequence.

Core Operations

  • Multi-head self-attention:

Attention(Q,K,V)=softmax(QKdk)V\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right)V

MultiHead(X)=Concat(head1,,headh)WO,headi=Attention(XWiQ,XWiK,XWiV)\mathrm{MultiHead}(X) = \mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_h)W^O, \quad \mathrm{head}_i = \mathrm{Attention}(XW^Q_i, XW^K_i, XW^V_i)

  • Feed-forward block:

FFN(x)=W2GELU(W1x+b1)+b2\mathrm{FFN}(x) = W_2\,\mathrm{GELU}(W_1\,x + b_1) + b_2

  • Decoder block:

x=x+MultiHead(LayerNorm(x)), x=x+FFN(LayerNorm(x)), output=LayerNorm(x)\begin{aligned} &x' = x + \mathrm{MultiHead}(\mathrm{LayerNorm}(x)),\ &x''= x' + \mathrm{FFN}(\mathrm{LayerNorm}(x')),\ &\text{output} = \mathrm{LayerNorm}(x'') \end{aligned}

The decoder acts autoregressively over action tokens at each time step.

4. Output Space and Action Decoding

RT-1-X learns to predict a discretized 7-DoF end-effector command along with an episodic termination signal.

Action Space Structure:

Dimension Discretization Description
Δx,Δy,Δz\Delta x, \Delta y, \Delta z 256 bins each Translational DoF
Δ\Deltaroll, pitch, yaw 256 bins each Rotational DoF
Gripper (open/close) 256 bins Gripper state
“Done” token 1 token Episode termination

This yields a total output vocabulary size of 7×256+1=17937 \times 256 + 1 = 1793 tokens. For each time step, the transformer decodes all 8 tokens (7 actions plus “done”) sequentially via a linear projection followed by softmax.

5. Training Protocol and Cross-Embodiment Generalization

RT-1-X is trained from scratch on the coalesced robotics mixture. The loss is the cross-entropy between predicted and target action tokens for the 8-token output per time step. There is no explicit weight sharing, parameter partitioning, robot-embedding, or adapter structures; the full network must generalize context and policy across all robot instances.

Positive transfer across robot platforms is an explicit observed property: performance improves on individual robots when the model is jointly trained on multi-embodiment data. Inference runs at 3–10 Hz on physical robots.

RT-1-X’s principal architectural element is its strict architectural identity with RT-1, with the “-X” designation encoding its cross-embodiment training regimen. Unlike RT-2-X, which introduces scaling (5–55B parameters) and utilizes pre-trained vision-language backbones with action as text, RT-1-X remains in the 35M parameter regime and retains structured action tokenization. No per-robot modules or explicit adaptation layers are introduced; all adaptation arises from visual input variance and cross-domain training.

This model should be distinguished from prior single-embodiment architectures by its use of multi-embodiment training and tight data curation standards, which facilitate effective transfer and scaling across the robotics domain (Collaboration et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RT-1-X Architecture.