Papers
Topics
Authors
Recent
Search
2000 character limit reached

Modularized Cross-embodiment Transformer (MXT)

Updated 20 March 2026
  • The paper introduces MXT as an embodiment-aware transformer that explicitly encodes robot morphology via kinematic tokenization, topology-aware attention, and FiLM-based joint conditioning.
  • MXT improves cross-robot policy learning by delivering 2–3× higher success rates over vanilla VLA transformer approaches across single- and multi-embodiment scenarios.
  • The architecture modularly retains standard vision and language encoders while adding specialized modules to enable scalable, morphology-driven policy transfer.

The Modularized Cross-embodiment Transformer (MXT) is an embodiment-aware transformer-based policy architecture for cross-robot policy learning, designed to address the inherent limitations of traditional vision-language-action (VLA) transformer models in generalizing across heterogeneous robot embodiments. By explicitly integrating robot morphology within the action-policy module, MXT enables robust policy transfer across disparate robotic platforms, achieving significant empirical improvements over vanilla approaches in both single- and multi-embodiment settings (Suzuki et al., 26 Feb 2026).

1. High-Level Architectural Framework

MXT retains the core structure of a standard VLA policy, specifically the π₀.₅ model, which utilizes separate image and language-prompt encoders to produce sequential “context” tokens. An autoregressive, diffusion-based action expert further generates a sequence of future joint-space commands.

MXT departs from baseline approaches by explicitly introducing robot morphology at the action-policy level via three specialized modules:

  • Kinematic Tokens (KT): Factorize and compress the joint-action space into per-joint, temporally chunked tokens.
  • Topology-Aware Attention Bias: Incorporate robot kinematic topology as an inductive bias in joint-related self-attention blocks.
  • Joint-Attribute Conditioning (FiLM): Modulate joint token embeddings with affine transforms parameterized by each joint’s local physical descriptors.

Critically, all cross-modal attention mechanisms between vision, language, and action in the VLA backbone are preserved; morphology-specific customization is restricted to the joint-to-joint attention sub-block.

2. Kinematic Tokenization and Temporal Chunking

Given JJ actuated joints, prediction horizon HH, and number of temporal chunks GG with chunk size g=H/Gg = H/G, MXT forms per-joint, temporally compressed kinematic tokens. For each chunk k{0,,G1}k \in \{0, \ldots, G-1\}, the temporal index set Tk={kg,,(k+1)g1}T_k = \{ k \cdot g, \ldots, (k+1) \cdot g-1 \} partitions the future time steps.

For joint jj, the sequence of raw future actions in chunk kk is:

bj,Tk:=[at,j]tTkRg,b_{j, T_k} := [a_{t, j}]_{t \in T_k} \in \mathbb{R}^g,

where at,ja_{t,j} denotes the action for joint jj at time tt.

The kinematic token embedding zj,Tkz_{j,T_k} is produced via a lightweight MLP encoder (“Enc₀”):

zj,Tk=Enc0(bj,Tk)Rd.z_{j,T_k} = \mathrm{Enc}_0(b_{j,T_k}) \in \mathbb{R}^d.

The set of JGJ \cdot G kinematic tokens is inserted into the transformer’s overall token sequence. Optionally, additional capacity can be obtained by employing MM auxiliary encoders per chunk.

3. Topology-Aware Joint Attention Bias

The interaction across joint tokens is explicitly modulated by the robot’s kinematic topology, formalized as an undirected graph G=(V,E)\mathcal{G} = (V, E) with J=VJ = |V| joints. Several attention biasing mechanisms are implemented:

  • Adjacency Indicator: Mij=1M_{ij} = 1 if i=ji = j or (i,j)E(i, j) \in E, else $0$.
  • Hard-Mask (Full-Mask): At each transformer layer \ell, set Bij()=0B_{ij}^{(\ell)} = 0 if Mij=1M_{ij}=1, -\infty otherwise (restricts message passing to 1-hop neighbors or the node itself).
  • Mix-Mask: Alternates between Full-Mask (even \ell) and unrestricted attention (odd \ell).
  • Soft-Mask (SPD-Bias): Utilizes shortest-path distance d(i,j)d(i,j) in G\mathcal{G} to select from a learnable table θ()[0Dmax]\theta^{(\ell)}[0 \ldots D_{\text{max}}], resulting in Bij()=θ()[d(i,j)]B_{ij}^{(\ell)} = \theta^{(\ell)}[d(i,j)].

The modified self-attention for joint tokens at layer \ell is given by:

αij()=softmaxj(Qi()Kj()Td+Bij()),\alpha_{ij}^{(\ell)} = \mathrm{softmax}_j \left( \frac{Q_i^{(\ell)} \cdot K_j^{(\ell)T}}{\sqrt{d}} + B_{ij}^{(\ell)} \right),

where Qi()Q_i^{(\ell)} and Kj()K_j^{(\ell)} are the query and key projections, respectively.

4. Joint-Attribute Conditioning via FiLM

Each joint jj is represented by a feature vector sjRFs_j \in \mathbb{R}^F, which may include categorical joint types, axis orientation, limit parameters, and physical attributes such as (log-)damping, friction, and (log-)stiffness. A FiLM (Feature-wise Linear Modulation) network computes a pair of affine parameters (γj,βj)Rd×Rd(\gamma_j, \beta_j) \in \mathbb{R}^d \times \mathbb{R}^d per joint:

(γj,βj)=FiLM(sj),(\gamma_j, \beta_j) = \text{FiLM}(s_j),

which are applied to the per-joint token embedding:

z~j,Tk=(1+γj)zj,Tk+βj,\tilde{z}_{j,T_k} = (1 + \gamma_j) \odot z_{j,T_k} + \beta_j,

where \odot denotes element-wise multiplication.

5. Action Head and Joint-Factorized Decoding

After LL transformer layers, MXT maintains two sets of updated embeddings: time-coupled action tokens and joint-factorized kinematic tokens. The original diffusion-expert head from π0.5\pi_{0.5} is preserved, but its output is now conditioned on these morphology-aware representations via cross-attention or MLP fusion.

Decoding per joint proceeds as follows:

1
2
3
4
5
6
7
for each joint j:
    gather its G conditioned tokens {tilde_z_{j,T_0}, ..., tilde_z_{j,T_{G-1}}}
    stack into Z_j  ℝ^{G×d}
    pooled_j = Pool(Z_j)  # average or attentional pooling
    μ_j, σ_j = HeadMLP_j(pooled_j)
    predict vector hat_b_{j,T_k} ~ 𝒩(μ_j, σ_j)
assemble full action predictions hat_a_{t,j} from chunks

By this joint-factorized scheme, the MXT action head efficiently reconstructs per-joint, per-chunk predictions, repacking these into the final output sequence.

6. Empirical Evaluation and Performance Gains

MXT demonstrates substantial quantitative improvements over the vanilla π0.5\pi_{0.5} VLA baseline across both single- and multi-embodiment language-conditioned pick-and-place simulation tasks:

Configuration Success Rate (SR%) 95% CI
π₀.₅ (Panda, DROID subset) 19.7 ±4.5
+ Kinematic Tokens 36.0 ±5.4
+ Mix-Mask Topology 36.9 ±5.4
+ FiLM (no mask) 37.7 ±5.5
Full MXT (KT + Mix-Mask + FiLM) 47.4 ±5.6
π₀.₅ (Unitree G1 Dex1, 16-DoF) 24.7 ±4.9
Full MXT (Dex1) 28.0 ±5.0
π₀.₅ (Panda + SO101, 50k steps) 5.0
MXT (Panda + SO101, 50k) 15.5
π₀.₅ (Panda + SO101, 125k) 17.5
MXT (Panda + SO101, 125k) 20.7

Across all tested regimes, MXT yields 2–3× relative gains in success rate over vanilla transformers, indicating improved within- and cross-embodiment generalization (Suzuki et al., 26 Feb 2026).

7. Significance, Modularity, and Relation to Prior Art

MXT provides a modular “morphology module” for any VLA transformer policy without altering upstream vision or language encodings. By introducing per-joint tokenization, explicit kinematic graph-based attention biasing, and FiLM-based joint attribute conditioning at the policy level, MXT enables robust transfer across diverse robotic embodiments.

Compared to approaches such as X-VLA, which employs embodiment-specific soft prompts for cross-embodiment adaptation in a parameter-efficient manner (Zheng et al., 11 Oct 2025), MXT instead structurally encodes kinematic and physical properties into the policy transformer itself. A plausible implication is that explicit graph-based and descriptor-based wiring—in contrast to soft prompt conditioning—may provide superior adaptation in domains where morphology strongly structures feasible action spaces.

MXT thereby constitutes a principled advance in the architecture of generalist robotic policies, systematically leveraging morphology both as inductive bias and as a guide for scalable transfer in cross-embodiment scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modularized Cross-embodiment Transformer (MXT).