Modularized Cross-embodiment Transformer (MXT)
- The paper introduces MXT as an embodiment-aware transformer that explicitly encodes robot morphology via kinematic tokenization, topology-aware attention, and FiLM-based joint conditioning.
- MXT improves cross-robot policy learning by delivering 2–3× higher success rates over vanilla VLA transformer approaches across single- and multi-embodiment scenarios.
- The architecture modularly retains standard vision and language encoders while adding specialized modules to enable scalable, morphology-driven policy transfer.
The Modularized Cross-embodiment Transformer (MXT) is an embodiment-aware transformer-based policy architecture for cross-robot policy learning, designed to address the inherent limitations of traditional vision-language-action (VLA) transformer models in generalizing across heterogeneous robot embodiments. By explicitly integrating robot morphology within the action-policy module, MXT enables robust policy transfer across disparate robotic platforms, achieving significant empirical improvements over vanilla approaches in both single- and multi-embodiment settings (Suzuki et al., 26 Feb 2026).
1. High-Level Architectural Framework
MXT retains the core structure of a standard VLA policy, specifically the π₀.₅ model, which utilizes separate image and language-prompt encoders to produce sequential “context” tokens. An autoregressive, diffusion-based action expert further generates a sequence of future joint-space commands.
MXT departs from baseline approaches by explicitly introducing robot morphology at the action-policy level via three specialized modules:
- Kinematic Tokens (KT): Factorize and compress the joint-action space into per-joint, temporally chunked tokens.
- Topology-Aware Attention Bias: Incorporate robot kinematic topology as an inductive bias in joint-related self-attention blocks.
- Joint-Attribute Conditioning (FiLM): Modulate joint token embeddings with affine transforms parameterized by each joint’s local physical descriptors.
Critically, all cross-modal attention mechanisms between vision, language, and action in the VLA backbone are preserved; morphology-specific customization is restricted to the joint-to-joint attention sub-block.
2. Kinematic Tokenization and Temporal Chunking
Given actuated joints, prediction horizon , and number of temporal chunks with chunk size , MXT forms per-joint, temporally compressed kinematic tokens. For each chunk , the temporal index set partitions the future time steps.
For joint , the sequence of raw future actions in chunk is:
where denotes the action for joint at time .
The kinematic token embedding is produced via a lightweight MLP encoder (“Enc₀”):
The set of kinematic tokens is inserted into the transformer’s overall token sequence. Optionally, additional capacity can be obtained by employing auxiliary encoders per chunk.
3. Topology-Aware Joint Attention Bias
The interaction across joint tokens is explicitly modulated by the robot’s kinematic topology, formalized as an undirected graph with joints. Several attention biasing mechanisms are implemented:
- Adjacency Indicator: if or , else $0$.
- Hard-Mask (Full-Mask): At each transformer layer , set if , otherwise (restricts message passing to 1-hop neighbors or the node itself).
- Mix-Mask: Alternates between Full-Mask (even ) and unrestricted attention (odd ).
- Soft-Mask (SPD-Bias): Utilizes shortest-path distance in to select from a learnable table , resulting in .
The modified self-attention for joint tokens at layer is given by:
where and are the query and key projections, respectively.
4. Joint-Attribute Conditioning via FiLM
Each joint is represented by a feature vector , which may include categorical joint types, axis orientation, limit parameters, and physical attributes such as (log-)damping, friction, and (log-)stiffness. A FiLM (Feature-wise Linear Modulation) network computes a pair of affine parameters per joint:
which are applied to the per-joint token embedding:
where denotes element-wise multiplication.
5. Action Head and Joint-Factorized Decoding
After transformer layers, MXT maintains two sets of updated embeddings: time-coupled action tokens and joint-factorized kinematic tokens. The original diffusion-expert head from is preserved, but its output is now conditioned on these morphology-aware representations via cross-attention or MLP fusion.
Decoding per joint proceeds as follows:
1 2 3 4 5 6 7 |
for each joint j: gather its G conditioned tokens {tilde_z_{j,T_0}, ..., tilde_z_{j,T_{G-1}}} stack into Z_j ∈ ℝ^{G×d} pooled_j = Pool(Z_j) # average or attentional pooling μ_j, σ_j = HeadMLP_j(pooled_j) predict vector hat_b_{j,T_k} ~ 𝒩(μ_j, σ_j) assemble full action predictions hat_a_{t,j} from chunks |
By this joint-factorized scheme, the MXT action head efficiently reconstructs per-joint, per-chunk predictions, repacking these into the final output sequence.
6. Empirical Evaluation and Performance Gains
MXT demonstrates substantial quantitative improvements over the vanilla VLA baseline across both single- and multi-embodiment language-conditioned pick-and-place simulation tasks:
| Configuration | Success Rate (SR%) | 95% CI |
|---|---|---|
| π₀.₅ (Panda, DROID subset) | 19.7 | ±4.5 |
| + Kinematic Tokens | 36.0 | ±5.4 |
| + Mix-Mask Topology | 36.9 | ±5.4 |
| + FiLM (no mask) | 37.7 | ±5.5 |
| Full MXT (KT + Mix-Mask + FiLM) | 47.4 | ±5.6 |
| π₀.₅ (Unitree G1 Dex1, 16-DoF) | 24.7 | ±4.9 |
| Full MXT (Dex1) | 28.0 | ±5.0 |
| π₀.₅ (Panda + SO101, 50k steps) | 5.0 | — |
| MXT (Panda + SO101, 50k) | 15.5 | — |
| π₀.₅ (Panda + SO101, 125k) | 17.5 | — |
| MXT (Panda + SO101, 125k) | 20.7 | — |
Across all tested regimes, MXT yields 2–3× relative gains in success rate over vanilla transformers, indicating improved within- and cross-embodiment generalization (Suzuki et al., 26 Feb 2026).
7. Significance, Modularity, and Relation to Prior Art
MXT provides a modular “morphology module” for any VLA transformer policy without altering upstream vision or language encodings. By introducing per-joint tokenization, explicit kinematic graph-based attention biasing, and FiLM-based joint attribute conditioning at the policy level, MXT enables robust transfer across diverse robotic embodiments.
Compared to approaches such as X-VLA, which employs embodiment-specific soft prompts for cross-embodiment adaptation in a parameter-efficient manner (Zheng et al., 11 Oct 2025), MXT instead structurally encodes kinematic and physical properties into the policy transformer itself. A plausible implication is that explicit graph-based and descriptor-based wiring—in contrast to soft prompt conditioning—may provide superior adaptation in domains where morphology strongly structures feasible action spaces.
MXT thereby constitutes a principled advance in the architecture of generalist robotic policies, systematically leveraging morphology both as inductive bias and as a guide for scalable transfer in cross-embodiment scenarios.