UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Published 21 Apr 2026 in cs.RO and cs.AI | (2604.19734v1)

Abstract: Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a unified latent action tokenizer that integrates vision and action data to bridge human and humanoid motion gaps.
It employs a tri-branch encoder with bidirectional cross-reconstruction to generate discrete, embodiment-agnostic tokens for policy learning and world modeling.
The approach achieves robust out-of-distribution generalization and zero-shot transfer, outperforming baselines on RoboCasa GR1 with significantly less data.

UniT: Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Motivation and Background

The scarcity of high-quality robotic data presents a critical bottleneck for scaling foundation models in humanoid policy learning and world modeling. By contrast, large-scale egocentric human motion datasets encapsulate diverse physical interaction priors. However, kinematic and embodiment gaps—distributional mismatch, differing DoFs, and control paradigms—impede cross-embodiment transfer. Traditional motion retargeting pipelines (e.g., inverse kinematics) are labor-intensive, domain-specific, and scale poorly across heterogeneous morphologies.

Recent approaches exploiting latent action representations are limited by modality-specific design flaws: action-only methods lack external grounding, vision-only methods entangle appearance confounders and miss structural pose detail, and independent vision/action tokenizers fail to produce unified control vocabularies. Empirical validation on dexterous humanoid tasks remains sparse and underexplored.

UniT Framework: Architecture and Principles

UniT introduces a unified latent action tokenizer with a visual anchoring mechanism, operationalized via a tri-branch cross-reconstruction architecture. This methodology leverages the premise that visual outcomes encode universal physical intent, irrespective of embodiment-specific kinematics.

Three encoding branches are deployed:

Visual branch: Summarizes temporal transitions between consecutive observations using frozen DINOv2 features.
Action branch: Encodes state-action sequences from heterogeneous morphologies, normalizing through per-embodiment MLPs.
Fusion branch: Integrates vision and action latents for compact visuo-motor representations.

All branches are quantized via a shared RQ-VAE codebook, yielding discrete Unified Latent Action (UniT) tokens. Cross-reconstruction is enforced: each token is decoded by both visual and action decoders, requiring bidirectional reconstruction of physical outcomes and kinematic details. This bidirectional constraint enforces the intersection of modalities and discards unaligned noise, producing tokens embodying embodiment-agnostic physical intent.

Figure 1: Overview of the UniT Framework with unified tokenization and downstream policy/world modeling.

Figure 2: Comparative analysis showing the benefits of UniT's cross-modal alignment over action-only, vision-only, and decoupled paradigms.

Figure 3: UniT architecture demonstrating tri-branch encoding and quantization.

Policy Learning with VLA-UniT

Instead of direct regression to raw actions, VLA-UniT leverages UniT tokens as prediction targets in Vision-Language-Action (VLA) architectures. The policy decomposition involves predicting UniT tokens from vision-language context (leveraging Qwen2.5-VL), followed by embodiment-specific action generation via a lightweight flow-matching expert.

VLA-UniT demonstrates robust OOD generalization, improved sample efficiency, and zero-shot task transfer on both simulation (RoboCasa GR1) and real-world humanoid settings. Incorporating large-scale human demonstrations (EgoDex) translates directly to enhanced policy performance and generalization along multiple axes—geometry, distractor, target, and visual background.

Figure 4: Downstream deployment of UniT tokens in VLA policy learning and WM world modeling.

Figure 5: Example of real-world in-domain tasks designed for humanoid benchmarking.

Figure 6: OOD evaluation scenarios leveraging human data to fill unexplored variation gaps.

Unified World Modeling with WM-UniT

WM-UniT employs UniT tokens as universal action conditions in action-conditioned world models, replacing embodiment-specific raw actions. The action branch tokens inject visuo-motor priors learned from human data, supporting fine-grained, controllable autoregressive video generation—validated empirically on DROID and RoboCasa benchmarks. Pretraining on human data enhances downstream humanoid controllability, reinforcing the latent action transfer capacity.

Direct cross-embodiment conditioning is demonstrated: feeding UniT tokens derived from one morphology (human/robot) as conditions produces faithful control and video generation in the other, preserving semantic, temporal, and geometric consistency of action dynamics.

Figure 7: t-SNE analysis showing alignment between human and humanoid token embeddings at various representational levels.

Figure 8: Robustness to injected action noise—visual anchoring filters uncorrelated artifacts, preserving trajectory fidelity.

Figure 9: Human-to-robot conditioning yields accurate cross-embodiment video generation compared to raw action conditioning.

Figure 10: Robot-to-human conditioning illustrating preservation of fine-grained pose and action semantics.

Empirical Evaluation and Ablation

VLA-UniT achieves superior policy success rates on RoboCasa GR1, outperforming baselines by margin (66.7% overall, +18.9% over GR00T) and requiring ∼10× less data for competitive performance. Human demonstrations boost both in-domain and OOD success, enabling zero-shot transfer and emergent upper-body coordination in unseen tasks.

Ablations validate UniT's architectural claims:

Vision-action synergy is necessary for transfer; single-modality tokenizers underperform in challenging OOD scenarios.
Explicit cross-reconstruction yields aligned token distributions and downstream internal representations, critical for generalizable transfer.
Bidirectional reconstruction outperforms unidirectional vision-to-action models (VLA-Villa), emphasizing deep cross-modal integration.
Figure 11: Performance comparison: VLA-UniT dominates state-of-the-art baselines on RoboCasa GR1.

Figure 12: Sample efficiency and impact of human co-training—substantial gains in few-shot scenarios.

Figure 13: Real-world deployment: boosting both execution and OOD robustness via human demonstration.

Figure 14: Zero-shot transfer on unseen stacking task—uniquely enabled by UniT's shared latent space.

Figure 15: Tokenizer ablation demonstrating effectiveness of vision-action cross-reconstruction in policy generalization.

Theoretical and Practical Implications

UniT formalizes a scalable, data-driven unified physical language, circumventing manual retargeting and domain-specific solvers. This framework offers a universal tokenization interface for both policy and world modeling, facilitating co-evolution of imagined rollouts, test-time planning, and reinforcement learning within a single latent action space.

The visual branch's ability to encode physical transitions without paired action labels suggests that vast internet-scale human video datasets can be harnessed, augmenting physical priors and broadening embodied intelligence coverage. This approach is poised to unlock dexterous coordination, robust generalization, and compositional reasoning directly from multimodal human demonstrations.

Conclusion

UniT advances the state of the art in cross-embodiment policy learning and world modeling for humanoids, backed by strong empirical results and architectural ablations. Visual-anchored latent tokenization and bidirectional cross-reconstruction are fundamental for producing robust, transferable action representations. Future work should scale UniT to internet-scale video, explore joint policy/world-model planning, and further exploit compositional generalization by leveraging diverse and unstructured human demonstrations.

Markdown Report Issue