RT-1-X: 35M-Param Multi-Robot Transformer
- RT-1-X is a 35M-parameter transformer model designed for multi-embodiment robotic manipulation with unified visual and language inputs.
- It employs multi-modal fusion via FiLM to process standardized vision-language tokens, enabling effective cross-robot generalization.
- The model predicts discretized 7-DoF action commands and demonstrates positive transfer across 22 robot embodiments with no explicit adaptation layers.
RT-1-X is a 35 million-parameter transformer-based model for robotic manipulation that enables multi-embodiment policy learning across a diverse set of robots and tasks. As part of the RT-X family, RT-1-X is designed to ingest visual and linguistic information from a unified data format, generating discretized end-effector actions with the goal of positive transfer across distinct robot morphologies. It builds upon the RT-1 architecture and is contrasted with the much larger RT-2-X model, which utilizes vision-language backbones. The core innovation of RT-1-X lies in its capability to be co-trained on “X-embodiment”—large-scale datasets containing demonstrations from multiple robots—with no explicit per-robot adaptation layers, leveraging multi-modal fusion and action prediction to generalize across platforms (Collaboration et al., 2023).
1. Design Objectives and Data Formulation
RT-1-X targets generalization of robot manipulation skills by consolidating policy representations into a single high-capacity transformer model. The central objective is to learn a shared policy capable of operating a spectrum of robots by exploiting positive transfer from cross-robot demonstration data. Datasets assembled for RT-1-X consist of 22 robot embodiments from 21 collaborating institutions, standardizing inputs and outputs to facilitate consistent training and evaluation. This approach is in contrast to earlier methods relying on task- or robot-specific models.
Key data curation features include:
- A unified “robotics mixture” dataset representing 9 robot manipulators and 527 unique skills (160,266 tasks), aligned to a coarse 7-DoF action space.
- Standardization of camera placement, visual history, and language instructions to reduce cross-embodiment variance.
2. Input Representations and Multi-Modal Encoding
RT-1-X employs a fixed input protocol that synthesizes multi-modal data into sequence tokens suitable for transformer processing.
Visual Input:
- Each robot is instrumented with a single canonical RGB camera, with all frames resized uniformly (e.g., 256×256).
- The model operates over the last images from the trajectory's history.
- Each image is processed by an ImageNet-pretrained EfficientNet backbone; resulting feature maps are flattened into tokens.
Language Input:
- Task instructions are given as natural language and converted to embeddings using a pretrained Universal Sentence Encoder (USE).
Multi-Modal Fusion (FiLM):
- Fusion of visual and linguistic information is achieved via Feature-Wise Linear Modulation (FiLM). The USE embedding generates per-feature scale and shift parameters modulating EfficientNet feature maps pre-tokenization, resulting in approximately 81 vision-language tokens per sample.
No explicit proprioceptive or force-sensor inputs are described, with all relevant state information represented implicitly through images and action outputs. Embodiment-specific differences are handled exclusively through variation in visual inputs.
3. Transformer Architecture and Mathematical Formalism
RT-1-X adopts a causal decoder-only transformer architecture with autoregressive action prediction. Exact architectural hyperparameters such as the number of layers (), hidden dimensions (), attention heads (), and feedforward width () are not specified.
- Standard 1D positional embeddings are used for both the concatenated vision-language token sequence and the step position in the output action sequence.
Core Operations
- Multi-head self-attention:
- Feed-forward block:
- Decoder block:
The decoder acts autoregressively over action tokens at each time step.
4. Output Space and Action Decoding
RT-1-X learns to predict a discretized 7-DoF end-effector command along with an episodic termination signal.
Action Space Structure:
| Dimension | Discretization | Description |
|---|---|---|
| 256 bins each | Translational DoF | |
| roll, pitch, yaw | 256 bins each | Rotational DoF |
| Gripper (open/close) | 256 bins | Gripper state |
| “Done” token | 1 token | Episode termination |
This yields a total output vocabulary size of tokens. For each time step, the transformer decodes all 8 tokens (7 actions plus “done”) sequentially via a linear projection followed by softmax.
5. Training Protocol and Cross-Embodiment Generalization
RT-1-X is trained from scratch on the coalesced robotics mixture. The loss is the cross-entropy between predicted and target action tokens for the 8-token output per time step. There is no explicit weight sharing, parameter partitioning, robot-embedding, or adapter structures; the full network must generalize context and policy across all robot instances.
Positive transfer across robot platforms is an explicit observed property: performance improves on individual robots when the model is jointly trained on multi-embodiment data. Inference runs at 3–10 Hz on physical robots.
6. Innovation and Comparison to Related Models
RT-1-X’s principal architectural element is its strict architectural identity with RT-1, with the “-X” designation encoding its cross-embodiment training regimen. Unlike RT-2-X, which introduces scaling (5–55B parameters) and utilizes pre-trained vision-language backbones with action as text, RT-1-X remains in the 35M parameter regime and retains structured action tokenization. No per-robot modules or explicit adaptation layers are introduced; all adaptation arises from visual input variance and cross-domain training.
This model should be distinguished from prior single-embodiment architectures by its use of multi-embodiment training and tight data curation standards, which facilitate effective transfer and scaling across the robotics domain (Collaboration et al., 2023).