Unified Vision-Motion Codes (UVMC)

Updated 9 November 2025

Unified Vision-Motion Codes (UVMC) are discrete, multimodal representations that bridge high-dimensional visual inputs and robotic actions in VLA systems.
They employ a dual-branch VQ-VAE with a shared codebook to simultaneously encode visual dynamics and robotic motion for robust policy learning.
The approach leverages a three-stage training paradigm, enhancing cross-embodiment transfer and few-shot adaptation in diverse robotic tasks.

Unified Vision-Motion Codes (UVMC) are a class of discrete, embodiment-agnostic latent representations designed to serve as an intermediate information bottleneck between perceptual observations and low-level motor actions, particularly in vision-language-action (VLA) robotic systems. UVMCs, as introduced in the XR-1 framework, are learned with a dual-branch Vector-Quantized Variational Autoencoder (VQ-VAE) and act as a modality-bridging substrate accommodating both visual dynamics and robotic motion within a shared discretized codebook. This mechanism addresses major challenges in robotics—namely, translating high-dimensional sensory data into precise actions and enabling cross-domain learning from heterogeneous multimodal datasets that include diverse robot architectures and human demonstrations (Fan et al., 4 Nov 2025).

1. Dual-Branch VQ-VAE Architecture

The UVMC is realized via a two-headed VQ-VAE comprising parallel encoders and decoders, both governed by a shared codebook $E \in \mathbb{R}^{256 \times 256}$ . The architectural split is as follows:

Visual Dynamics Branch:
- Encoder $E_{vis}(o_t, o_{t+h})$ : Processes paired RGB frames through a SigLIP backbone (~400M parameters) followed by a 4-layer ViT “dynamics” transformer (~32M parameters), resulting in 13 latent vectors $z_{vis} \in \mathbb{R}^{13 \times 256}$ .
- Decoder $D_{vis}(o_t, z_{vis}^q)$ : A 12-layer ViT (~94M parameters) reconstructs the future frame $\hat{o}_{t+h}$ .
Robotic Motion Branch:
- Encoder $E_{mo}(a_{t:t+h}, m_{t:t+h})$ : Utilizes cascaded causal 1D strided convolutions (akin to WaveNet) with an 8-layer transformer (~34M parameters) on a sequence of $h=50$ steps of actions and proprioception, yielding $z_{mo} \in \mathbb{R}^{13 \times 256}$ .
- Decoder $D_{mo}(z_{mo}^q, \ell, obs)$ : A 300M parameter autoregressive Gemma transformer, conditioned on quantized motion codes, language tokens, and the current observation, reconstructs the action sequence $\hat{a}_{t:t+h}$ .

Both encoder streams output discretized latents via nearest-neighbor lookups in the shared codebook. Specifically, for each $z_e$ ( $e \in \{\mathrm{vis}, \mathrm{mo}\}$ ), the quantized code is: $z_e^q = S(z_e) = E_j$ where $j = \arg\min_i \|z_e - E_i\|_2$ . The concatenation of $z_{vis}^q$ and $z_{mo}^q$ forms the UVMC tokens $z_{uvmc}^q \in \mathbb{R}^{26 \times 256}$ .

2. VQ-VAE Training Objectives and Alignment

The loss structure for each branch mirrors the standard VQ-VAE objective, comprising reconstruction, codebook, and commitment losses (all using $L_2$ norm), parameterized by codebook commitment $\beta = 0.25$ . For $e \in \{\mathrm{vis}, \mathrm{mo}\}$ : $L_{\text{VQ-VAE}}^{(e)} = \|x^{(e)} - \hat{x}^{(e)}\|_2^2 + \|sg[z_e] - z_e^q\|_2^2 + \beta\|z_e - sg[z_e^q]\|_2^2$ where $x^{(vis)} = o_{t+h}$ and $x^{(mo)} = a_{t:t+h}$ .

To promote semantic alignment, a cross-modality KL divergence loss ties the posterior distributions over visual and motion codes: $L_{\text{align}} = D_{KL}(q(z_{mo})\|q(z_{vis}))$ The combined loss on mixed (robot) data is $L_{stage1}^{robot} = L_{\text{VQ-VAE}}^{(vis)} + L_{\text{VQ-VAE}}^{(mo)} + L_{align}$ ; for human-only videos the objective omits the motion and alignment terms.

3. UVMC as a Multimodal Policy Bottleneck

UVMC tokens serve as a discrete, temporally coherent intermediary unifying high-dimensional observation and low-level action spaces. Each time step combines 13 visual-dynamics codes and 13 motion codes, resulting in $z_{uvmc}^q$ . This structure:

Summarizes pixel-level dynamic changes and the agent’s internal intent.
Binds video-only (human demonstration) data and robotic action trajectories to a single representational space.
Functions as a shared latent that links vision, language, and action for downstream policy learning.

During policy training, UVMC tokens are injected into the input stream, aligning the policy to predict or utilize these codes as an embodiment-agnostic state descriptor.

4. Three-Stage Training Paradigm

XR-1’s utilization of UVMC follows a structured three-stage paradigm:

Self-Supervised UVMC Learning:
- Scale: 1.26M episodes (~110M frames) from Open-X (40%), RoboMIND (15%), XR-D (35%), Ego4D (10%), and other sources.
- Joint VQ-VAE/codebook learning using both robot and human video data.
- ~0.9B parameters, batch size 960, 275K steps, LR $1\times 10^{-4}$ , 38.4 K GPU-hours on 80 A100s.
UVMC-Guided Pretraining on Cross-Embodiment Data:
- Uses VLA policy backbone $F$ (e.g., Gemma 2.6B for XR-1, Florence-2 230M for XR-1-Light).
- Inserts [ZVIS] and [ZMO] tokens; $F$ predicts $z_{uvmc}^q$ from (language $\ell$ , observations).
- Joint loss: $L_{uvmc} = \|F(\ell,\text{obs}; [\mathrm{ZVIS}], [\mathrm{ZMO}]) - z_{uvmc}^q\|_2^2$ , plus generative action head loss $L_{act}$ .
- ~4B parameters, batch 640, 300K steps, 38.4K GPU-hours.
Task-Specific Post-Training:
- Fine-tunes the policy for each robot and task on expert rollouts ($20$ tasks $\times$ $20$ rollouts/task, 576 GPU-hours on 8 A100s).
- Loss focuses solely on $L_{act}$ for fine-tuning.

This progression enables scalable learning across tasks, robots, and data modalities while allowing specialization with minimal additional data for new embodiments.

5. Critical Hyperparameters and Design Considerations

Key implementation details for UVMC and XR-1:

Parameter	Value	Notes
Codebook size ( $d$ )	256	Shared across modalities
Embedding dim ( $f$ )	256	For each code
Tokens per branch	13	Total: 26 UVMC tokens
Sequence length ( $h$ )	50	For motion encoding
Commitment weight ( $\beta$ )	0.25	VQ-VAE loss parameter
Vision encoder	SigLIP + 4 $\times$ ViT layers	~400M parameters
Vision decoder	12 $\times$ ViT layers	~94M parameters
Motion encoder	Strided conv + 8 $\times$ transformer	~34M parameters
Motion decoder	Gemma autoregressive (300M)
Policy backbone	PaliGemma (XR-1), Florence-2 (XR-1-Light)	Model-agnostic design

Design decisions, such as discrete codebook size and shared latent vocabulary, are specifically aimed at reducing overfitting to appearance or embodiment particulars and facilitating cross-domain transfer.

6. Empirical Performance and Validation

XR-1, utilizing UVMC, yields improvements across all major evaluation axes in real-world robotic settings. Metrics are reported over $>$ 14,000 rollouts, six robot embodiments, and more than 120 manipulation tasks.

Action Precision & Multi-Task Success:
- Dual-Arm UR-5e: 72.0% (XR-1 avg) vs 62.0% ( $\pi_{0.5}$ ), 42.8% ( $\pi_0$ )
- Tien Kung 2.0 (unseen): 72.0% vs 41.0% ( $\pi_{0.5}$ ), 40.8% ( $\pi_0$ )
- Tien Kung 1.0: 68.0% vs 45.5% ( $\pi_{0.5}$ )
- Dual-Arm Franka: 73.5% vs 41.0% ( $\pi_{0.5}$ )
- AgileX Cobot Magic V2.0: 60.0% vs 41.3% ( $\pi_{0.5}$ )
- Single-Arm UR-5e: 75.3% vs 67.5% ( $\pi_{0.5}$ )
Cross-Embodiment Transfer:
- Out-of-box on 7 held-out tasks ( $\approx$ 0.9% of XR-D): XR-1-oob achieves 50–80% success, comparable to GR00T-N1.5 and surpassing UniVLA, RDT.
Few-Shot Adaptation:
- 15 novel tasks $\times$ 20 trajectories: XR-1 achieves $\sim$ 70–80% success (vs ACT $\sim$ 45% and DiffusionPolicy $\sim$ 40%).
Robustness:
- DFR-SweepTrash (novel rubbish): 65% (XR-1) vs 15% ( $\pi_0$ )
- DFR-SweepTrash (dynamic distractors): 55% vs 5% ( $\pi_0$ )
- DFR-HangCup (background shifts): 55% vs 30% ( $\pi_0$ )
Lightweight Variant (XR-1-Light):
- w/o UVMC finetune: 42.5% avg; with UVMC finetune: 57.5% (↑ 15pp)

These results indicate that UVMC acts as a stabilizing, generalizable bottleneck, promoting robust learning across tasks, robustness to visual and operational domain shifts, and data-efficient fine-tuning on new robotics platforms.

7. Context, Significance, and Limitations

UVMCs, as instantiated in XR-1, systematically address long-standing VLA bottlenecks by promoting a discrete, temporally resolved, and codebook-aligned state abstraction spanning both vision and motion modalities. Their competence in handling domain heterogeneity and facilitating both generalist and specialist policy learning is corroborated through large-scale, multi-embodiment empiricism.

A plausible implication is that the UVMC approach could offer superior architectural inductive bias for generalizable robot learning in settings with limited or imbalanced action-annotated data, especially for few-shot adaptation. Additionally, the ability to align human-only and robot-embodied trajectories in a joint embedding space significantly broadens the practical reach of large-scale, multimodal datasets for robotics.

A limitation, as suggested by the reported model scales and GPU-hour requirements, is the substantial upfront computational cost for full-stack UVMC learning, although this is partially addressed by the availability of lightweight variants (e.g., XR-1-Light with Florence-2). The explicit dependence on carefully balanced and aligned dual-branch objectives further indicates a need for continued ablation and sensitivity analysis.

The introduction of UVMC thus marks a significant advancement in the scalable, generalizable, and data-efficient learning across diverse VLA domains, signifying a robust step forward for unified robotics representation learning (Fan et al., 4 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations (2025)

Follow Topic

Get notified by email when new papers are published related to Unified Vision-Motion Codes (UVMC).