TraceGen: A 3D Trace-Space World Model
- TraceGen is a world model paradigm that represents short-term robotic motion through a compact 3D trace-space, focusing solely on geometric trajectories.
- It employs a conditional flow/ODE architecture with a multi-encoder backbone to rapidly forecast 3D trajectories, bypassing the inefficiencies of pixel-space models.
- Supported by the scalable TraceForge pipeline, TraceGen enables sample-efficient, cross-embodiment adaptation and real-time robotic manipulation in diverse environments.
TraceGen is a world model paradigm for robotic learning that introduces a 3D “trace-space” as a unified, symbolic representation for scene-level manipulation. It frames future motion prediction as forecasting the geometric evolution of compact 3D trajectories, enabling efficient adaptation and transfer across embodiments, camera perspectives, and environments. In contrast to traditional pixel-space or token-based models, TraceGen’s abstraction removes dependence on texture, appearance, and embodiment, retaining only the geometric structures necessary for manipulation. This approach substantially increases inference speed and drastically reduces the requirement for per-platform annotated demonstrations. TraceGen is supported by TraceForge, a scalable pipeline for deriving dense 3D traces from large, heterogeneous human and robot video datasets, and positions trace-space as a core intermediate in cross-embodiment robot learning (Lee et al., 26 Nov 2025).
1. 3D Trace-Space Representation
Trace-space models a short temporal horizon as a set of spatial keypoints, each tracked over timesteps in a reference camera frame. The construction process is as follows:
- Reference Frame Keypoints: Place a uniform grid of pixels on the RGB reference at time .
- Trajectory Tracking: For each keypoint , record the image-plane coordinates and depth at every step .
- Camera Motion Compensation: Points are transformed from world coordinates to the reference camera using extrinsic and intrinsic calibration. The transformation equations are:
- 0 (world to camera frame)
- Projection to 1 via intrinsics, recording 2.
- Speed Retargeting: Each 3D point sequence is reparameterized by arc length 3 and resampled at uniform 4 values, normalizing for velocity across embodiments.
The complete trace tensor is 5, abstracting away texture, shape, and object identity while maintaining a geometric substrate for effective manipulation planning (Lee et al., 26 Nov 2025).
2. Model Architecture and Training Objectives
TraceGen employs a conditional flow/ODE model to forecast future trace increments, structured as follows:
- Multi-Encoder Backbone:
- Fusion and Conditioning:
- Concatenated and linearly projected visual features: 0
- All conditioning tokens: 1
- Flow-Based Decoder:
- The spatial grid is patchified to 2 tokens per frame for 3 timesteps.
- The stochastic interpolant framework interpolates between pure noise and ground truth increments. The model 4 is trained to match the time derivative, minimizing the mean squared residual:
5
Training Regime:
- Encoders are frozen; only the visual fusion and flow decoder are trained.
- Inference consists of 100-step ODE integration to reconstruct the full predicted trace.
This architecture enables rapid conditional generation of 3D trajectories while being lightweight (0.67B parameters) and suitable for real-time deployment (Lee et al., 26 Nov 2025).
3. Data Pipeline and Pretraining Corpus
TraceForge is the dedicated data pipeline for trace extraction and pretraining dataset assembly:
- Event Chunking: Automatic segmentation of scenes by motion magnitude or existing annotations.
- Pose, Depth, and Tracking: Fast depth & pose estimation (VGGT, adapted from SpatialTrackerV2); 2D tracking with CoTracker3 and 3D with TAPIP3D.
- World-to-Camera Alignment: Tracked 3D points are aligned to a consistent reference camera, then projected to screen coordinates.
- Speed Retargeting: Arc-length parametrization and uniform resampling preserves temporal consistency.
- Language Instruction Synthesis: For each chunk, a VLM generates (i) concise imperative, (ii) stepwise, and (iii) natural-language instructions.
The pipeline compiles over 123K video episodes, spanning eight large-scale datasets and more than 1.8M (RGB, depth, language) → 3D trace triplets, exceeding prior 3D-flow work by more than an order of magnitude. Pretraining is executed over 200K steps with batch size 64 and a schedule employing cosine decay and weight decay 6. This forms a powerful generic 3D motion prior useful for rapid adaptation (Lee et al., 26 Nov 2025).
4. Few-Shot Adaptation and Embodiment Transfer
TraceGen realizes rapid adaptation through a few-shot “warm-up” procedure, leveraging its embodiment-agnostic trajectory outputs:
- Robot→Robot (in-domain): Five target robot videos are used for fine-tuning (500 gradient steps, learning rate 7, batch size 4). No data or layout filtering is enforced, supporting open-world variability.
- Human→Robot (cross-embodiment): Five handheld, uncalibrated human demonstration videos of 3–4 seconds each are sufficient. No ground-truth extrinsics or object detectors are needed; the process is detector-free and works with in-the-wild footage.
- Controller: The predicted 8 grid serves as end-effector goals, tracked via a simple inverse-kinematics controller.
This process enables cross-embodiment learning: models can transition seamlessly between human demonstrations and robot deployment (Lee et al., 26 Nov 2025).
5. Empirical Results and Performance Characteristics
TraceGen has been quantitatively evaluated in real-world tabletop manipulation with the Franka Research 3 robot:
- Zero-shot: All methods except proprietary Veo3.1 collapsed to 0% success.
- After 5-video warm-up:
- TraceGen: 80.0% mean success (32/40 trials across four skills)
- From-scratch: 25.0% under identical conditions
- Inference speed: TraceGen executes 380 predictions/minute (A5000 GPU); NovaFlow runs 9600× slower
- Human→Robot adaptation: TraceGen attains 67.5% (27/40) with only five uncalibrated human phone videos. From-scratch baseline: 0%.
- Scaling: Doubling the number of warm-up videos (from 5 to 15) produces only marginal gains (80.0% → 82.5%).
- Pretraining composition ablation: Cross-embodiment, heterogeneous data is essential. SSV2-only: 25%; Agibot-only: 45%; full diverse corpus: 70%.
- Long-horizon stability: TraceGen maintains ≈80% per-subtask success across four-step “Sorting,” with no cumulative error growth, whereas from-scratch models degrade to 40%.
A summary of results is given below:
| Setting | Success Rate | Inference Speed |
|---|---|---|
| 5 video warm-up (robot) | 80.0% | 380 pred/min (TraceGen) |
| 5 video warm-up (human) | 67.5% | Same |
| From-scratch | 25–0% | – |
| NovaFlow (API) | 15–20% | <1 pred/min |
| 3DFlowAction | 0% | 100 pred/min |
These results indicate that trace-space models are substantially more sample-efficient and computationally tractable than pixel-space video generators (Lee et al., 26 Nov 2025).
6. Advantages, Limitations, and Context
Trace-Space Advantages:
- Purely geometric prediction focuses model capacity on actionable structure, ignoring irrelevant appearance.
- Cross-embodiment and viewpoint invariance arise from the abstraction of trajectories into a world-centric grid.
- The lightweight architecture enables real-time inference and few-shot transfer.
- Model outputs are directly usable as metric-space goals for low-level controllers.
Comparative Analysis:
- Pixel-space models must reconstruct dense visual data, resulting in high latency and inefficiency.
- Pixel-based systems lack direct 3D outputs, often requiring additional post-processing.
- Fine-grained accuracy and stability are superior in trace-space over long horizons due to normalized arc-length and controller tracking.
Limitations and Future Directions:
- The linear stochastic interpolant training schedule exhibits limited mode control; diffusive alternatives may improve robustness.
- Noisy or corrective demonstration segments in pretraining data can degrade the prior; enhanced filtering is an open direction.
- Zero-shot generalization remains limited when exposed to novel objects or very different bodies.
- High-precision manipulation tasks may require denser grids or hybrid action modules.
- Non-anthropomorphic embodiments (e.g., legged systems) are untested within TraceGen's current abstraction and may reveal the limits of fixed-grid representations.
A subsequent advance, 0, replaces TraceGen’s fixed-grid methodology with B-spline-based semantic keypoints, globally aligned 3D curves, and hierarchical event-centric chunking, yielding further improvements in both trace forecasting and downstream policy conditioning (Lee et al., 11 Jun 2026).
7. Significance and Extensions
TraceGen constitutes a shift in robotic world modeling by situating geometric trace-space as the core scaffolding for learning manipulation from large, diverse, cross-embodiment video corpora. Its data pipeline and generative design democratize access to robot-relevant priors from open video, removing dependence on hand-labeled actions, object detectors, or physically calibrated demonstration data. TraceGen’s reflective influence is evident in subsequent work such as 1, which extends its architectural and representational framework for scalable, compositional, and cross-domain robot learning (Lee et al., 26 Nov 2025, Lee et al., 11 Jun 2026).