Modularized Cross-embodiment Transformer (MXT)

Updated 20 March 2026

The paper introduces MXT as an embodiment-aware transformer that explicitly encodes robot morphology via kinematic tokenization, topology-aware attention, and FiLM-based joint conditioning.
MXT improves cross-robot policy learning by delivering 2–3× higher success rates over vanilla VLA transformer approaches across single- and multi-embodiment scenarios.
The architecture modularly retains standard vision and language encoders while adding specialized modules to enable scalable, morphology-driven policy transfer.

The Modularized Cross-embodiment Transformer (MXT) is an embodiment-aware transformer-based policy architecture for cross-robot policy learning, designed to address the inherent limitations of traditional vision-language-action (VLA) transformer models in generalizing across heterogeneous robot embodiments. By explicitly integrating robot morphology within the action-policy module, MXT enables robust policy transfer across disparate robotic platforms, achieving significant empirical improvements over vanilla approaches in both single- and multi-embodiment settings (Suzuki et al., 26 Feb 2026).

1. High-Level Architectural Framework

MXT retains the core structure of a standard VLA policy, specifically the π₀.₅ model, which utilizes separate image and language-prompt encoders to produce sequential “context” tokens. An autoregressive, diffusion-based action expert further generates a sequence of future joint-space commands.

MXT departs from baseline approaches by explicitly introducing robot morphology at the action-policy level via three specialized modules:

Kinematic Tokens (KT): Factorize and compress the joint-action space into per-joint, temporally chunked tokens.
Topology-Aware Attention Bias: Incorporate robot kinematic topology as an inductive bias in joint-related self-attention blocks.
Joint-Attribute Conditioning (FiLM): Modulate joint token embeddings with affine transforms parameterized by each joint’s local physical descriptors.

Critically, all cross-modal attention mechanisms between vision, language, and action in the VLA backbone are preserved; morphology-specific customization is restricted to the joint-to-joint attention sub-block.

2. Kinematic Tokenization and Temporal Chunking

Given $J$ actuated joints, prediction horizon $H$ , and number of temporal chunks $G$ with chunk size $g = H/G$ , MXT forms per-joint, temporally compressed kinematic tokens. For each chunk $k \in \{0, \ldots, G-1\}$ , the temporal index set $T_k = \{ k \cdot g, \ldots, (k+1) \cdot g-1 \}$ partitions the future time steps.

For joint $j$ , the sequence of raw future actions in chunk $k$ is:

$b_{j, T_k} := [a_{t, j}]_{t \in T_k} \in \mathbb{R}^g,$

where $a_{t,j}$ denotes the action for joint $j$ at time $t$ .

The kinematic token embedding $z_{j,T_k}$ is produced via a lightweight MLP encoder (“Enc₀”):

$z_{j,T_k} = \mathrm{Enc}_0(b_{j,T_k}) \in \mathbb{R}^d.$

The set of $J \cdot G$ kinematic tokens is inserted into the transformer’s overall token sequence. Optionally, additional capacity can be obtained by employing $M$ auxiliary encoders per chunk.

3. Topology-Aware Joint Attention Bias

The interaction across joint tokens is explicitly modulated by the robot’s kinematic topology, formalized as an undirected graph $\mathcal{G} = (V, E)$ with $J = |V|$ joints. Several attention biasing mechanisms are implemented:

Adjacency Indicator: $M_{ij} = 1$ if $i = j$ or $(i, j) \in E$ , else $0$.
Hard-Mask (Full-Mask): At each transformer layer $\ell$ , set $B_{ij}^{(\ell)} = 0$ if $M_{ij}=1$ , $-\infty$ otherwise (restricts message passing to 1-hop neighbors or the node itself).
Mix-Mask: Alternates between Full-Mask (even $\ell$ ) and unrestricted attention (odd $\ell$ ).
Soft-Mask (SPD-Bias): Utilizes shortest-path distance $d(i,j)$ in $\mathcal{G}$ to select from a learnable table $\theta^{(\ell)}[0 \ldots D_{\text{max}}]$ , resulting in $B_{ij}^{(\ell)} = \theta^{(\ell)}[d(i,j)]$ .

The modified self-attention for joint tokens at layer $\ell$ is given by:

$\alpha_{ij}^{(\ell)} = \mathrm{softmax}_j \left( \frac{Q_i^{(\ell)} \cdot K_j^{(\ell)T}}{\sqrt{d}} + B_{ij}^{(\ell)} \right),$

where $Q_i^{(\ell)}$ and $K_j^{(\ell)}$ are the query and key projections, respectively.

4. Joint-Attribute Conditioning via FiLM

Each joint $j$ is represented by a feature vector $s_j \in \mathbb{R}^F$ , which may include categorical joint types, axis orientation, limit parameters, and physical attributes such as (log-)damping, friction, and (log-)stiffness. A FiLM (Feature-wise Linear Modulation) network computes a pair of affine parameters $(\gamma_j, \beta_j) \in \mathbb{R}^d \times \mathbb{R}^d$ per joint:

$(\gamma_j, \beta_j) = \text{FiLM}(s_j),$

which are applied to the per-joint token embedding:

$\tilde{z}_{j,T_k} = (1 + \gamma_j) \odot z_{j,T_k} + \beta_j,$

where $\odot$ denotes element-wise multiplication.

5. Action Head and Joint-Factorized Decoding

After $L$ transformer layers, MXT maintains two sets of updated embeddings: time-coupled action tokens and joint-factorized kinematic tokens. The original diffusion-expert head from $\pi_{0.5}$ is preserved, but its output is now conditioned on these morphology-aware representations via cross-attention or MLP fusion.

Decoding per joint proceeds as follows:

for each joint j:
    gather its G conditioned tokens {tilde_z_{j,T_0}, ..., tilde_z_{j,T_{G-1}}}
    stack into Z_j ∈ ℝ^{G×d}
    pooled_j = Pool(Z_j)  # average or attentional pooling
    μ_j, σ_j = HeadMLP_j(pooled_j)
    predict vector hat_b_{j,T_k} ~ 𝒩(μ_j, σ_j)
assemble full action predictions hat_a_{t,j} from chunks

By this joint-factorized scheme, the MXT action head efficiently reconstructs per-joint, per-chunk predictions, repacking these into the final output sequence.

6. Empirical Evaluation and Performance Gains

MXT demonstrates substantial quantitative improvements over the vanilla $\pi_{0.5}$ VLA baseline across both single- and multi-embodiment language-conditioned pick-and-place simulation tasks:

Configuration	Success Rate (SR%)	95% CI
π₀.₅ (Panda, DROID subset)	19.7	±4.5
+ Kinematic Tokens	36.0	±5.4
+ Mix-Mask Topology	36.9	±5.4
+ FiLM (no mask)	37.7	±5.5
Full MXT (KT + Mix-Mask + FiLM)	47.4	±5.6
π₀.₅ (Unitree G1 Dex1, 16-DoF)	24.7	±4.9
Full MXT (Dex1)	28.0	±5.0
π₀.₅ (Panda + SO101, 50k steps)	5.0	—
MXT (Panda + SO101, 50k)	15.5	—
π₀.₅ (Panda + SO101, 125k)	17.5	—
MXT (Panda + SO101, 125k)	20.7	—

Across all tested regimes, MXT yields 2–3× relative gains in success rate over vanilla transformers, indicating improved within- and cross-embodiment generalization (Suzuki et al., 26 Feb 2026).

7. Significance, Modularity, and Relation to Prior Art

MXT provides a modular “morphology module” for any VLA transformer policy without altering upstream vision or language encodings. By introducing per-joint tokenization, explicit kinematic graph-based attention biasing, and FiLM-based joint attribute conditioning at the policy level, MXT enables robust transfer across diverse robotic embodiments.

Compared to approaches such as X-VLA, which employs embodiment-specific soft prompts for cross-embodiment adaptation in a parameter-efficient manner (Zheng et al., 11 Oct 2025), MXT instead structurally encodes kinematic and physical properties into the policy transformer itself. A plausible implication is that explicit graph-based and descriptor-based wiring—in contrast to soft prompt conditioning—may provide superior adaptation in domains where morphology strongly structures feasible action spaces.

MXT thereby constitutes a principled advance in the architecture of generalist robotic policies, systematically leveraging morphology both as inductive bias and as a guide for scalable transfer in cross-embodiment scenarios.

Markdown Report Issue Upgrade to Chat

References (2)

Embedding Morphology into Transformers for Cross-Robot Policy Learning (2026)

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modularized Cross-embodiment Transformer (MXT).