Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Unified Vision-Motion Codes (UVMC)

Updated 9 November 2025
  • Unified Vision-Motion Codes (UVMC) are discrete, multimodal representations that bridge high-dimensional visual inputs and robotic actions in VLA systems.
  • They employ a dual-branch VQ-VAE with a shared codebook to simultaneously encode visual dynamics and robotic motion for robust policy learning.
  • The approach leverages a three-stage training paradigm, enhancing cross-embodiment transfer and few-shot adaptation in diverse robotic tasks.

Unified Vision-Motion Codes (UVMC) are a class of discrete, embodiment-agnostic latent representations designed to serve as an intermediate information bottleneck between perceptual observations and low-level motor actions, particularly in vision-language-action (VLA) robotic systems. UVMCs, as introduced in the XR-1 framework, are learned with a dual-branch Vector-Quantized Variational Autoencoder (VQ-VAE) and act as a modality-bridging substrate accommodating both visual dynamics and robotic motion within a shared discretized codebook. This mechanism addresses major challenges in robotics—namely, translating high-dimensional sensory data into precise actions and enabling cross-domain learning from heterogeneous multimodal datasets that include diverse robot architectures and human demonstrations (Fan et al., 4 Nov 2025).

1. Dual-Branch VQ-VAE Architecture

The UVMC is realized via a two-headed VQ-VAE comprising parallel encoders and decoders, both governed by a shared codebook ER256×256E \in \mathbb{R}^{256 \times 256}. The architectural split is as follows:

  • Visual Dynamics Branch:
    • Encoder Evis(ot,ot+h)E_{vis}(o_t, o_{t+h}): Processes paired RGB frames through a SigLIP backbone (~400M parameters) followed by a 4-layer ViT “dynamics” transformer (~32M parameters), resulting in 13 latent vectors zvisR13×256z_{vis} \in \mathbb{R}^{13 \times 256}.
    • Decoder Dvis(ot,zvisq)D_{vis}(o_t, z_{vis}^q): A 12-layer ViT (~94M parameters) reconstructs the future frame o^t+h\hat{o}_{t+h}.
  • Robotic Motion Branch:
    • Encoder Emo(at:t+h,mt:t+h)E_{mo}(a_{t:t+h}, m_{t:t+h}): Utilizes cascaded causal 1D strided convolutions (akin to WaveNet) with an 8-layer transformer (~34M parameters) on a sequence of h=50h=50 steps of actions and proprioception, yielding zmoR13×256z_{mo} \in \mathbb{R}^{13 \times 256}.
    • Decoder Dmo(zmoq,,obs)D_{mo}(z_{mo}^q, \ell, obs): A 300M parameter autoregressive Gemma transformer, conditioned on quantized motion codes, language tokens, and the current observation, reconstructs the action sequence a^t:t+h\hat{a}_{t:t+h}.

Both encoder streams output discretized latents via nearest-neighbor lookups in the shared codebook. Specifically, for each zez_e (e{vis,mo}e \in \{\mathrm{vis}, \mathrm{mo}\}), the quantized code is: zeq=S(ze)=Ejz_e^q = S(z_e) = E_j where j=argminizeEi2j = \arg\min_i \|z_e - E_i\|_2. The concatenation of zvisqz_{vis}^q and zmoqz_{mo}^q forms the UVMC tokens zuvmcqR26×256z_{uvmc}^q \in \mathbb{R}^{26 \times 256}.

2. VQ-VAE Training Objectives and Alignment

The loss structure for each branch mirrors the standard VQ-VAE objective, comprising reconstruction, codebook, and commitment losses (all using L2L_2 norm), parameterized by codebook commitment β=0.25\beta = 0.25. For e{vis,mo}e \in \{\mathrm{vis}, \mathrm{mo}\}: LVQ-VAE(e)=x(e)x^(e)22+sg[ze]zeq22+βzesg[zeq]22L_{\text{VQ-VAE}}^{(e)} = \|x^{(e)} - \hat{x}^{(e)}\|_2^2 + \|sg[z_e] - z_e^q\|_2^2 + \beta\|z_e - sg[z_e^q]\|_2^2 where x(vis)=ot+hx^{(vis)} = o_{t+h} and x(mo)=at:t+hx^{(mo)} = a_{t:t+h}.

To promote semantic alignment, a cross-modality KL divergence loss ties the posterior distributions over visual and motion codes: Lalign=DKL(q(zmo)q(zvis))L_{\text{align}} = D_{KL}(q(z_{mo})\|q(z_{vis})) The combined loss on mixed (robot) data is Lstage1robot=LVQ-VAE(vis)+LVQ-VAE(mo)+LalignL_{stage1}^{robot} = L_{\text{VQ-VAE}}^{(vis)} + L_{\text{VQ-VAE}}^{(mo)} + L_{align}; for human-only videos the objective omits the motion and alignment terms.

3. UVMC as a Multimodal Policy Bottleneck

UVMC tokens serve as a discrete, temporally coherent intermediary unifying high-dimensional observation and low-level action spaces. Each time step combines 13 visual-dynamics codes and 13 motion codes, resulting in zuvmcqz_{uvmc}^q. This structure:

  • Summarizes pixel-level dynamic changes and the agent’s internal intent.
  • Binds video-only (human demonstration) data and robotic action trajectories to a single representational space.
  • Functions as a shared latent that links vision, language, and action for downstream policy learning.

During policy training, UVMC tokens are injected into the input stream, aligning the policy to predict or utilize these codes as an embodiment-agnostic state descriptor.

4. Three-Stage Training Paradigm

XR-1’s utilization of UVMC follows a structured three-stage paradigm:

  1. Self-Supervised UVMC Learning:
    • Scale: 1.26M episodes (~110M frames) from Open-X (40%), RoboMIND (15%), XR-D (35%), Ego4D (10%), and other sources.
    • Joint VQ-VAE/codebook learning using both robot and human video data.
    • ~0.9B parameters, batch size 960, 275K steps, LR 1×1041\times 10^{-4}, 38.4 K GPU-hours on 80 A100s.
  2. UVMC-Guided Pretraining on Cross-Embodiment Data:
    • Uses VLA policy backbone FF (e.g., Gemma 2.6B for XR-1, Florence-2 230M for XR-1-Light).
    • Inserts [ZVIS] and [ZMO] tokens; FF predicts zuvmcqz_{uvmc}^q from (language \ell, observations).
    • Joint loss: Luvmc=F(,obs;[ZVIS],[ZMO])zuvmcq22L_{uvmc} = \|F(\ell,\text{obs}; [\mathrm{ZVIS}], [\mathrm{ZMO}]) - z_{uvmc}^q\|_2^2, plus generative action head loss LactL_{act}.
    • ~4B parameters, batch 640, 300K steps, 38.4K GPU-hours.
  3. Task-Specific Post-Training:
    • Fine-tunes the policy for each robot and task on expert rollouts ($20$ tasks ×\times $20$ rollouts/task, 576 GPU-hours on 8 A100s).
    • Loss focuses solely on LactL_{act} for fine-tuning.

This progression enables scalable learning across tasks, robots, and data modalities while allowing specialization with minimal additional data for new embodiments.

5. Critical Hyperparameters and Design Considerations

Key implementation details for UVMC and XR-1:

Parameter Value Notes
Codebook size (dd) 256 Shared across modalities
Embedding dim (ff) 256 For each code
Tokens per branch 13 Total: 26 UVMC tokens
Sequence length (hh) 50 For motion encoding
Commitment weight (β\beta) 0.25 VQ-VAE loss parameter
Vision encoder SigLIP + 4×\timesViT layers ~400M parameters
Vision decoder 12×\timesViT layers ~94M parameters
Motion encoder Strided conv + 8×\timestransformer ~34M parameters
Motion decoder Gemma autoregressive (300M)
Policy backbone PaliGemma (XR-1), Florence-2 (XR-1-Light) Model-agnostic design

Design decisions, such as discrete codebook size and shared latent vocabulary, are specifically aimed at reducing overfitting to appearance or embodiment particulars and facilitating cross-domain transfer.

6. Empirical Performance and Validation

XR-1, utilizing UVMC, yields improvements across all major evaluation axes in real-world robotic settings. Metrics are reported over >>14,000 rollouts, six robot embodiments, and more than 120 manipulation tasks.

  • Action Precision & Multi-Task Success:
    • Dual-Arm UR-5e: 72.0% (XR-1 avg) vs 62.0% (π0.5\pi_{0.5}), 42.8% (π0\pi_0)
    • Tien Kung 2.0 (unseen): 72.0% vs 41.0% (π0.5\pi_{0.5}), 40.8% (π0\pi_0)
    • Tien Kung 1.0: 68.0% vs 45.5% (π0.5\pi_{0.5})
    • Dual-Arm Franka: 73.5% vs 41.0% (π0.5\pi_{0.5})
    • AgileX Cobot Magic V2.0: 60.0% vs 41.3% (π0.5\pi_{0.5})
    • Single-Arm UR-5e: 75.3% vs 67.5% (π0.5\pi_{0.5})
  • Cross-Embodiment Transfer:
    • Out-of-box on 7 held-out tasks (\approx0.9% of XR-D): XR-1-oob achieves 50–80% success, comparable to GR00T-N1.5 and surpassing UniVLA, RDT.
  • Few-Shot Adaptation:
    • 15 novel tasks ×\times 20 trajectories: XR-1 achieves \sim70–80% success (vs ACT \sim45% and DiffusionPolicy \sim40%).
  • Robustness:
    • DFR-SweepTrash (novel rubbish): 65% (XR-1) vs 15% (π0\pi_0)
    • DFR-SweepTrash (dynamic distractors): 55% vs 5% (π0\pi_0)
    • DFR-HangCup (background shifts): 55% vs 30% (π0\pi_0)
  • Lightweight Variant (XR-1-Light):
    • w/o UVMC finetune: 42.5% avg; with UVMC finetune: 57.5% (↑ 15pp)

These results indicate that UVMC acts as a stabilizing, generalizable bottleneck, promoting robust learning across tasks, robustness to visual and operational domain shifts, and data-efficient fine-tuning on new robotics platforms.

7. Context, Significance, and Limitations

UVMCs, as instantiated in XR-1, systematically address long-standing VLA bottlenecks by promoting a discrete, temporally resolved, and codebook-aligned state abstraction spanning both vision and motion modalities. Their competence in handling domain heterogeneity and facilitating both generalist and specialist policy learning is corroborated through large-scale, multi-embodiment empiricism.

A plausible implication is that the UVMC approach could offer superior architectural inductive bias for generalizable robot learning in settings with limited or imbalanced action-annotated data, especially for few-shot adaptation. Additionally, the ability to align human-only and robot-embodied trajectories in a joint embedding space significantly broadens the practical reach of large-scale, multimodal datasets for robotics.

A limitation, as suggested by the reported model scales and GPU-hour requirements, is the substantial upfront computational cost for full-stack UVMC learning, although this is partially addressed by the availability of lightweight variants (e.g., XR-1-Light with Florence-2). The explicit dependence on carefully balanced and aligned dual-branch objectives further indicates a need for continued ablation and sensitivity analysis.

The introduction of UVMC thus marks a significant advancement in the scalable, generalizable, and data-efficient learning across diverse VLA domains, signifying a robust step forward for unified robotics representation learning (Fan et al., 4 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Unified Vision-Motion Codes (UVMC).