Papers
Topics
Authors
Recent
Search
2000 character limit reached

TraceGen: A 3D Trace-Space World Model

Updated 30 June 2026
  • TraceGen is a world model paradigm that represents short-term robotic motion through a compact 3D trace-space, focusing solely on geometric trajectories.
  • It employs a conditional flow/ODE architecture with a multi-encoder backbone to rapidly forecast 3D trajectories, bypassing the inefficiencies of pixel-space models.
  • Supported by the scalable TraceForge pipeline, TraceGen enables sample-efficient, cross-embodiment adaptation and real-time robotic manipulation in diverse environments.

TraceGen is a world model paradigm for robotic learning that introduces a 3D “trace-space” as a unified, symbolic representation for scene-level manipulation. It frames future motion prediction as forecasting the geometric evolution of compact 3D trajectories, enabling efficient adaptation and transfer across embodiments, camera perspectives, and environments. In contrast to traditional pixel-space or token-based models, TraceGen’s abstraction removes dependence on texture, appearance, and embodiment, retaining only the geometric structures necessary for manipulation. This approach substantially increases inference speed and drastically reduces the requirement for per-platform annotated demonstrations. TraceGen is supported by TraceForge, a scalable pipeline for deriving dense 3D traces from large, heterogeneous human and robot video datasets, and positions trace-space as a core intermediate in cross-embodiment robot learning (Lee et al., 26 Nov 2025).

1. 3D Trace-Space Representation

Trace-space models a short temporal horizon as a set of K=400K=400 spatial keypoints, each tracked over LL timesteps in a reference camera frame. The construction process is as follows:

  1. Reference Frame Keypoints: Place a uniform 20×2020\times20 grid of pixels on the RGB reference at time tt.
  2. Trajectory Tracking: For each keypoint ii, record the image-plane coordinates (xit+,yit+)(x_{i}^{\,t+\ell}, y_{i}^{\,t+\ell}) and depth zit+z_{i}^{\,t+\ell} at every step =0,...,L1\ell=0,...,L-1.
  3. Camera Motion Compensation: Points are transformed from world coordinates to the reference camera using extrinsic [Rttt][R_t|t_t] and intrinsic KK calibration. The transformation equations are:
    • LL0 (world to camera frame)
    • Projection to LL1 via intrinsics, recording LL2.
  4. Speed Retargeting: Each 3D point sequence is reparameterized by arc length LL3 and resampled at uniform LL4 values, normalizing for velocity across embodiments.

The complete trace tensor is LL5, abstracting away texture, shape, and object identity while maintaining a geometric substrate for effective manipulation planning (Lee et al., 26 Nov 2025).

2. Model Architecture and Training Objectives

TraceGen employs a conditional flow/ODE model to forecast future trace increments, structured as follows:

  • Multi-Encoder Backbone:
    • RGB features: DINOv3 (ViT-L/16) LL6
    • Semantic features: SigLIP LL7
    • Depth features: SigLIP (with 1x1 conv stem) LL8
    • Text instruction: frozen T5-Base LL9
  • Fusion and Conditioning:
    • Concatenated and linearly projected visual features: 20×2020\times200
    • All conditioning tokens: 20×2020\times201
  • Flow-Based Decoder:
    • The spatial grid is patchified to 20×2020\times202 tokens per frame for 20×2020\times203 timesteps.
    • The stochastic interpolant framework interpolates between pure noise and ground truth increments. The model 20×2020\times204 is trained to match the time derivative, minimizing the mean squared residual:

    20×2020\times205

  • Training Regime:

    • Encoders are frozen; only the visual fusion and flow decoder are trained.
    • Inference consists of 100-step ODE integration to reconstruct the full predicted trace.

This architecture enables rapid conditional generation of 3D trajectories while being lightweight (0.67B parameters) and suitable for real-time deployment (Lee et al., 26 Nov 2025).

3. Data Pipeline and Pretraining Corpus

TraceForge is the dedicated data pipeline for trace extraction and pretraining dataset assembly:

  • Event Chunking: Automatic segmentation of scenes by motion magnitude or existing annotations.
  • Pose, Depth, and Tracking: Fast depth & pose estimation (VGGT, adapted from SpatialTrackerV2); 2D tracking with CoTracker3 and 3D with TAPIP3D.
  • World-to-Camera Alignment: Tracked 3D points are aligned to a consistent reference camera, then projected to screen coordinates.
  • Speed Retargeting: Arc-length parametrization and uniform resampling preserves temporal consistency.
  • Language Instruction Synthesis: For each chunk, a VLM generates (i) concise imperative, (ii) stepwise, and (iii) natural-language instructions.

The pipeline compiles over 123K video episodes, spanning eight large-scale datasets and more than 1.8M (RGB, depth, language) → 3D trace triplets, exceeding prior 3D-flow work by more than an order of magnitude. Pretraining is executed over 200K steps with batch size 64 and a schedule employing cosine decay and weight decay 20×2020\times206. This forms a powerful generic 3D motion prior useful for rapid adaptation (Lee et al., 26 Nov 2025).

4. Few-Shot Adaptation and Embodiment Transfer

TraceGen realizes rapid adaptation through a few-shot “warm-up” procedure, leveraging its embodiment-agnostic trajectory outputs:

  • Robot→Robot (in-domain): Five target robot videos are used for fine-tuning (500 gradient steps, learning rate 20×2020\times207, batch size 4). No data or layout filtering is enforced, supporting open-world variability.
  • Human→Robot (cross-embodiment): Five handheld, uncalibrated human demonstration videos of 3–4 seconds each are sufficient. No ground-truth extrinsics or object detectors are needed; the process is detector-free and works with in-the-wild footage.
  • Controller: The predicted 20×2020\times208 grid serves as end-effector goals, tracked via a simple inverse-kinematics controller.

This process enables cross-embodiment learning: models can transition seamlessly between human demonstrations and robot deployment (Lee et al., 26 Nov 2025).

5. Empirical Results and Performance Characteristics

TraceGen has been quantitatively evaluated in real-world tabletop manipulation with the Franka Research 3 robot:

  • Zero-shot: All methods except proprietary Veo3.1 collapsed to 0% success.
  • After 5-video warm-up:
    • TraceGen: 80.0% mean success (32/40 trials across four skills)
    • From-scratch: 25.0% under identical conditions
    • Inference speed: TraceGen executes 380 predictions/minute (A5000 GPU); NovaFlow runs 20×2020\times209600× slower
  • Human→Robot adaptation: TraceGen attains 67.5% (27/40) with only five uncalibrated human phone videos. From-scratch baseline: 0%.
  • Scaling: Doubling the number of warm-up videos (from 5 to 15) produces only marginal gains (80.0% → 82.5%).
  • Pretraining composition ablation: Cross-embodiment, heterogeneous data is essential. SSV2-only: 25%; Agibot-only: 45%; full diverse corpus: 70%.
  • Long-horizon stability: TraceGen maintains ≈80% per-subtask success across four-step “Sorting,” with no cumulative error growth, whereas from-scratch models degrade to 40%.

A summary of results is given below:

Setting Success Rate Inference Speed
5 video warm-up (robot) 80.0% 380 pred/min (TraceGen)
5 video warm-up (human) 67.5% Same
From-scratch 25–0%
NovaFlow (API) 15–20% <1 pred/min
3DFlowAction 0% 100 pred/min

These results indicate that trace-space models are substantially more sample-efficient and computationally tractable than pixel-space video generators (Lee et al., 26 Nov 2025).

6. Advantages, Limitations, and Context

Trace-Space Advantages:

  • Purely geometric prediction focuses model capacity on actionable structure, ignoring irrelevant appearance.
  • Cross-embodiment and viewpoint invariance arise from the abstraction of trajectories into a world-centric grid.
  • The lightweight architecture enables real-time inference and few-shot transfer.
  • Model outputs are directly usable as metric-space goals for low-level controllers.

Comparative Analysis:

  • Pixel-space models must reconstruct dense visual data, resulting in high latency and inefficiency.
  • Pixel-based systems lack direct 3D outputs, often requiring additional post-processing.
  • Fine-grained accuracy and stability are superior in trace-space over long horizons due to normalized arc-length and controller tracking.

Limitations and Future Directions:

  • The linear stochastic interpolant training schedule exhibits limited mode control; diffusive alternatives may improve robustness.
  • Noisy or corrective demonstration segments in pretraining data can degrade the prior; enhanced filtering is an open direction.
  • Zero-shot generalization remains limited when exposed to novel objects or very different bodies.
  • High-precision manipulation tasks may require denser grids or hybrid action modules.
  • Non-anthropomorphic embodiments (e.g., legged systems) are untested within TraceGen's current abstraction and may reveal the limits of fixed-grid representations.

A subsequent advance, tt0, replaces TraceGen’s fixed-grid methodology with B-spline-based semantic keypoints, globally aligned 3D curves, and hierarchical event-centric chunking, yielding further improvements in both trace forecasting and downstream policy conditioning (Lee et al., 11 Jun 2026).

7. Significance and Extensions

TraceGen constitutes a shift in robotic world modeling by situating geometric trace-space as the core scaffolding for learning manipulation from large, diverse, cross-embodiment video corpora. Its data pipeline and generative design democratize access to robot-relevant priors from open video, removing dependence on hand-labeled actions, object detectors, or physically calibrated demonstration data. TraceGen’s reflective influence is evident in subsequent work such as tt1, which extends its architectural and representational framework for scalable, compositional, and cross-domain robot learning (Lee et al., 26 Nov 2025, Lee et al., 11 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TraceGen World Model.