$μ_0$: A Scalable 3D Interaction-Trace World Model

Published 11 Jun 2026 in cs.RO, cs.CV, and cs.LG | (2606.13769v2)

Abstract: World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a novel 3D trace-based world model that predicts smooth, event-aligned trajectories from diverse manipulation videos.
It employs a hierarchical TraceExtract pipeline with DINOv2 features and B-spline parameterization to robustly track semantic keypoints.
The model outperforms baselines in trace prediction and robot policy transfer, achieving high success rates in both simulation and real-world tasks.

$μ_0$ : A Scalable 3D Interaction-Trace World Model — Authoritative Analysis

Motivation and Problem Formulation

This paper introduces $μ_0$ , a scalable world model that leverages 3D interaction traces as its core representation, enabling learning from diverse manipulation videos without dependence on embodiment-specific action labels. The conventional approaches face critical limitations: pixel-space models waste capacity on appearance and backgrounds, failing to capture manipulation-relevant geometry and occlusion; action-labeled models are constrained by scarcity, cost, and embodiment specificity of labeled robot data; prior motion-centric methods under-sample task-critical regions, suffer from camera-motion conflation, and lack event-level semantic alignment.

$μ_0$ addresses these through a principled middle-ground: prediction of smooth 3D trajectories for semantically salient interaction points (objects, tools, hands, contact regions), offering a compact, transferable, and embodiment-agnostic motion interface for downstream control policies.

TraceExtract Pipeline: Scaling and Structuring Supervision

The paper introduces TraceExtract, a data engine that transforms heterogeneous manipulation videos into event-captioned 3D trace tuples. TraceExtract achieves this via three major stages:

Semantic keypoint selection: Leveraging DINOv2 patch cluster features, TraceExtract samples action-centric entities, maintaining spatial diversity and minimizing background/static bias.
Global-local 3D trace construction: Employing hybrid VGGT-based reconstruction and progressive tracking, traces are globally aligned and identities preserved across long, egocentric, and dynamic videos.
Event-centric hierarchical captioning: Savitzky–Golay filtering and motion-peak chunking segment traces into action events, each paired with multi-resolution language captions generated by VLMs and merged by an LLM.

This pipeline yields tuples containing observations, language, query-keypoints, and event-aligned 3D traces, substantially scaling trace curation ( $8\times$ prior datasets) and delivering supervision robust to context and embodiment.

$μ_0$ Architecture: Query-Conditioned Trace World Model

$μ_0$ is trained to forecast future 3D traces for arbitrary query keypoints, conditioned on vision, language, and optionally history/depth inputs. The architecture comprises:

VLM-conditioned multimodal backbone: SmolVLM2-2.2B serves as the frozen prefix, encoding RGB, text instructions, and optional metric depth via a dedicated patch stem.
Permutation-equivariant Trace Expert: Each query keypoint is processed exchangeably via a stack predicting anchor-relative cubic B-spline targets, injects DINOv2 local features for entity semantics, and employs a flat tokenization splitting history/future.
Conditional flow matching objective: The model is trained not as a regressor but via flow-matching (semantic denoising) over B-spline control points, yielding multimodal futures robust to uncertainty, occlusion, and partial trajectories. Auxiliary heads predict trace validity and enforce rigidity within DINO clusters.

This yields a reusable, embodiment-agnostic motion prior, capable of generalizing across variable query sets and event contexts.

Trace-Conditioned Action Adaptation

For robot policy learning, $μ_0$ is frozen post pretraining; a downstream action expert (policy head) predicts executable action chunks, consuming intermediate trace denoising features fused via gated cross-attention with VLM embeddings, together with robot observation and proprioceptive signals. This interface ensures transferability to diverse robot embodiments, decoupling motion priors from control interfaces.

Empirical Evaluation: Trace Prediction and Policy Transfer

Numerical results are strong and consistent:

Trace prediction metrics: Across ADE, FDE, DTW, and Fréchet distances, $μ_0$ significantly outperforms both generic VLMs (Gemini, GPT-5.5), fixed-grid trace models (Track2Act, TraceGen), and 3D flow baselines, with the most accurate top-5 predictions and the lowest inference latency (0.29s, $2.9\times$ faster than nearest 2D baseline).
Simulation action generation (RoboCasa365): $μ_0$ with an action expert achieves a 30.25% average success rate, surpassing $μ_0$ 0 (25.25%) and prior video-only trace models, despite no action supervision in pretraining. While $μ_0$ 1 (with action-labeled supervision) scores higher (42%), the comparison is not data-matched.
Real-world robot tasks (UR3 arm): Across three tasks, $μ_0$ 2 achieves 91.7% average success rate, outperforming both action-labeled VLA models and prior video-only methods; the effect is pronounced when action head capacity is limited, demonstrating the value of structured trace priors.

Ablation studies validate the necessity of B-spline parameterization, DINOv2 feature injection, rigidity regularization, and input modality robustness. Scaling experiments show monotonic improvements with model size and pretraining data volume.

Distinguished from pixel- or action-labeled world models, $μ_0$ 3 offers a compact intermediate representation focusing solely on actionable motion. Unlike 2D/3D flow models or fixed-grid trace methods (e.g., TraceGen), $μ_0$ 4 delivers event-relevant, query-conditioned trajectories, globally aligned, and associated with hierarchical intent via language. The action adaptation via trace features sets it apart from post-hoc video-generation strategies, enabling cross-embodiment manipulation at scale.

Implications and Future Directions

Theoretically, $μ_0$ 5 advances a new world-model paradigm that privileges compact, actionable geometric representations, decoupled from robot-specific kinematics. Practically, this approach enables scaling robot policy learning from abundant unlabeled video, dramatically reducing reliance on scarce action-label datasets and facilitating transfer across platforms.

Future developments should address perception-stack errors (clustering, reconstruction, tracking, captioning), extend trace representations to include force/tactile modalities, and validate applicability to mobile/dexterous manipulators and longer-horizon tasks.

Conclusion

$μ_0$ 6 demonstrates the viability and transferability of 3D trace-based world modeling for robot manipulation, outperforming both pixel/flow-based and action-labeled baselines in forecasting and policy execution. The modular TraceExtract supervision pipeline and query-conditioned flow-matching architecture establish a powerful, scalable interface for downstream robotics, paving the way for video-pretrained models supporting generalist agents across diverse embodiments (2606.13769).