Unified Vision-Motion Codes (UVMC)
- Unified Vision-Motion Codes (UVMC) are discrete, multimodal representations that bridge high-dimensional visual inputs and robotic actions in VLA systems.
- They employ a dual-branch VQ-VAE with a shared codebook to simultaneously encode visual dynamics and robotic motion for robust policy learning.
- The approach leverages a three-stage training paradigm, enhancing cross-embodiment transfer and few-shot adaptation in diverse robotic tasks.
Unified Vision-Motion Codes (UVMC) are a class of discrete, embodiment-agnostic latent representations designed to serve as an intermediate information bottleneck between perceptual observations and low-level motor actions, particularly in vision-language-action (VLA) robotic systems. UVMCs, as introduced in the XR-1 framework, are learned with a dual-branch Vector-Quantized Variational Autoencoder (VQ-VAE) and act as a modality-bridging substrate accommodating both visual dynamics and robotic motion within a shared discretized codebook. This mechanism addresses major challenges in robotics—namely, translating high-dimensional sensory data into precise actions and enabling cross-domain learning from heterogeneous multimodal datasets that include diverse robot architectures and human demonstrations (Fan et al., 4 Nov 2025).
1. Dual-Branch VQ-VAE Architecture
The UVMC is realized via a two-headed VQ-VAE comprising parallel encoders and decoders, both governed by a shared codebook . The architectural split is as follows:
- Visual Dynamics Branch:
- Encoder : Processes paired RGB frames through a SigLIP backbone (~400M parameters) followed by a 4-layer ViT “dynamics” transformer (~32M parameters), resulting in 13 latent vectors .
- Decoder : A 12-layer ViT (~94M parameters) reconstructs the future frame .
- Robotic Motion Branch:
- Encoder : Utilizes cascaded causal 1D strided convolutions (akin to WaveNet) with an 8-layer transformer (~34M parameters) on a sequence of steps of actions and proprioception, yielding .
- Decoder : A 300M parameter autoregressive Gemma transformer, conditioned on quantized motion codes, language tokens, and the current observation, reconstructs the action sequence .
Both encoder streams output discretized latents via nearest-neighbor lookups in the shared codebook. Specifically, for each (), the quantized code is: where . The concatenation of and forms the UVMC tokens .
2. VQ-VAE Training Objectives and Alignment
The loss structure for each branch mirrors the standard VQ-VAE objective, comprising reconstruction, codebook, and commitment losses (all using norm), parameterized by codebook commitment . For : where and .
To promote semantic alignment, a cross-modality KL divergence loss ties the posterior distributions over visual and motion codes: The combined loss on mixed (robot) data is ; for human-only videos the objective omits the motion and alignment terms.
3. UVMC as a Multimodal Policy Bottleneck
UVMC tokens serve as a discrete, temporally coherent intermediary unifying high-dimensional observation and low-level action spaces. Each time step combines 13 visual-dynamics codes and 13 motion codes, resulting in . This structure:
- Summarizes pixel-level dynamic changes and the agent’s internal intent.
- Binds video-only (human demonstration) data and robotic action trajectories to a single representational space.
- Functions as a shared latent that links vision, language, and action for downstream policy learning.
During policy training, UVMC tokens are injected into the input stream, aligning the policy to predict or utilize these codes as an embodiment-agnostic state descriptor.
4. Three-Stage Training Paradigm
XR-1’s utilization of UVMC follows a structured three-stage paradigm:
- Self-Supervised UVMC Learning:
- Scale: 1.26M episodes (~110M frames) from Open-X (40%), RoboMIND (15%), XR-D (35%), Ego4D (10%), and other sources.
- Joint VQ-VAE/codebook learning using both robot and human video data.
- ~0.9B parameters, batch size 960, 275K steps, LR , 38.4 K GPU-hours on 80 A100s.
- UVMC-Guided Pretraining on Cross-Embodiment Data:
- Uses VLA policy backbone (e.g., Gemma 2.6B for XR-1, Florence-2 230M for XR-1-Light).
- Inserts [ZVIS] and [ZMO] tokens; predicts from (language , observations).
- Joint loss: , plus generative action head loss .
- ~4B parameters, batch 640, 300K steps, 38.4K GPU-hours.
- Task-Specific Post-Training:
- Fine-tunes the policy for each robot and task on expert rollouts ($20$ tasks $20$ rollouts/task, 576 GPU-hours on 8 A100s).
- Loss focuses solely on for fine-tuning.
This progression enables scalable learning across tasks, robots, and data modalities while allowing specialization with minimal additional data for new embodiments.
5. Critical Hyperparameters and Design Considerations
Key implementation details for UVMC and XR-1:
| Parameter | Value | Notes |
|---|---|---|
| Codebook size () | 256 | Shared across modalities |
| Embedding dim () | 256 | For each code |
| Tokens per branch | 13 | Total: 26 UVMC tokens |
| Sequence length () | 50 | For motion encoding |
| Commitment weight () | 0.25 | VQ-VAE loss parameter |
| Vision encoder | SigLIP + 4ViT layers | ~400M parameters |
| Vision decoder | 12ViT layers | ~94M parameters |
| Motion encoder | Strided conv + 8transformer | ~34M parameters |
| Motion decoder | Gemma autoregressive (300M) | |
| Policy backbone | PaliGemma (XR-1), Florence-2 (XR-1-Light) | Model-agnostic design |
Design decisions, such as discrete codebook size and shared latent vocabulary, are specifically aimed at reducing overfitting to appearance or embodiment particulars and facilitating cross-domain transfer.
6. Empirical Performance and Validation
XR-1, utilizing UVMC, yields improvements across all major evaluation axes in real-world robotic settings. Metrics are reported over 14,000 rollouts, six robot embodiments, and more than 120 manipulation tasks.
- Action Precision & Multi-Task Success:
- Dual-Arm UR-5e: 72.0% (XR-1 avg) vs 62.0% (), 42.8% ()
- Tien Kung 2.0 (unseen): 72.0% vs 41.0% (), 40.8% ()
- Tien Kung 1.0: 68.0% vs 45.5% ()
- Dual-Arm Franka: 73.5% vs 41.0% ()
- AgileX Cobot Magic V2.0: 60.0% vs 41.3% ()
- Single-Arm UR-5e: 75.3% vs 67.5% ()
- Cross-Embodiment Transfer:
- Out-of-box on 7 held-out tasks (0.9% of XR-D): XR-1-oob achieves 50–80% success, comparable to GR00T-N1.5 and surpassing UniVLA, RDT.
- Few-Shot Adaptation:
- 15 novel tasks 20 trajectories: XR-1 achieves 70–80% success (vs ACT 45% and DiffusionPolicy 40%).
- Robustness:
- DFR-SweepTrash (novel rubbish): 65% (XR-1) vs 15% ()
- DFR-SweepTrash (dynamic distractors): 55% vs 5% ()
- DFR-HangCup (background shifts): 55% vs 30% ()
- Lightweight Variant (XR-1-Light):
- w/o UVMC finetune: 42.5% avg; with UVMC finetune: 57.5% (↑ 15pp)
These results indicate that UVMC acts as a stabilizing, generalizable bottleneck, promoting robust learning across tasks, robustness to visual and operational domain shifts, and data-efficient fine-tuning on new robotics platforms.
7. Context, Significance, and Limitations
UVMCs, as instantiated in XR-1, systematically address long-standing VLA bottlenecks by promoting a discrete, temporally resolved, and codebook-aligned state abstraction spanning both vision and motion modalities. Their competence in handling domain heterogeneity and facilitating both generalist and specialist policy learning is corroborated through large-scale, multi-embodiment empiricism.
A plausible implication is that the UVMC approach could offer superior architectural inductive bias for generalizable robot learning in settings with limited or imbalanced action-annotated data, especially for few-shot adaptation. Additionally, the ability to align human-only and robot-embodied trajectories in a joint embedding space significantly broadens the practical reach of large-scale, multimodal datasets for robotics.
A limitation, as suggested by the reported model scales and GPU-hour requirements, is the substantial upfront computational cost for full-stack UVMC learning, although this is partially addressed by the availability of lightweight variants (e.g., XR-1-Light with Florence-2). The explicit dependence on carefully balanced and aligned dual-branch objectives further indicates a need for continued ablation and sensitivity analysis.
The introduction of UVMC thus marks a significant advancement in the scalable, generalizable, and data-efficient learning across diverse VLA domains, signifying a robust step forward for unified robotics representation learning (Fan et al., 4 Nov 2025).