TraceGen: Unified Trace-Space World Model

Updated 28 November 2025

TraceGen is a world model architecture that uses a trace-space representation—dense 3D keypoint trajectories—to abstract scene motion and support cross-domain learning.
It employs a multi-stream encoding approach combining RGB, depth, and language inputs to condition a flow-based decoder that generates velocity increments.
The model achieves rapid adaptation with high efficiency, reaching up to 80% task success on robot demonstrations and offering a significant speedup over pixel-level methods.

TraceGen denotes a class of approaches and a specific world model architecture for representing, predicting, and manipulating traces—compact, symbolic records of sequential structure—across multiple domains. The term has been instantiated in several applied contexts: as a symbolic generator for code-tracing exercises (Eisenhofer et al., 2022), as a probabilistic engine for infinite concurrent traces (Abbes et al., 2 Oct 2024), in digital content provenance (Gan et al., 28 Apr 2025), and, most recently, as a unified framework for large-scale cross-embodiment video learning in robotics using a 3D scene-centric "trace space" (Lee et al., 26 Nov 2025). This article focuses on the methodological innovations, computational primitives, and empirical results associated with the TraceGen world model (Lee et al., 26 Nov 2025), while situating its place among related trace-generating paradigms.

1. Motivation for Trace-Space World Modeling

A persistent obstacle in robot learning is the inability to leverage vast pools of heterogeneous demonstration videos—human and robotic—due to embodiment, camera, and environmental mismatches. Standard video world models expend capacity modeling appearance and lighting, and are costly to adapt across domains. TraceGen introduces a unifying "trace-space" representation: a dense, scene-centric array of 3D keypoint trajectories abstracted from appearance but retaining the geometric precision required for manipulation. Each trace is parameterized by $K$ tracked scene keypoints (e.g., $K=400$ ) over $L$ timesteps, in a canonical reference frame: $T_{\mathrm{ref}}^{t:t+L} = \{(x_i^t, y_i^t, z_i^t)\}_{i=1:K,\, t=1:L}$ This abstraction yields embodiment and camera invariance, densely summarizes scene motion, and enables learning from human, robot, and environmental demonstrations in a shared space, reducing the data and compute burden relative to pixel-level models (Lee et al., 26 Nov 2025).

2. TraceForge: Cross-Embodiment 3D Trace Dataset Construction

TraceForge is the data engine that transforms raw human and robot demonstration videos into usable trace-space representations for TraceGen pretraining. The pipeline consists of four automated stages:

Event Chunking & Instruction Generation: Short task-centric clips are extracted from raw videos, and for each, large-scale visual-LLMs generate multiple linguistic (imperative, sequential, natural) instructions.
3D Point Tracking with Depth and Camera Pose: Each frame is annotated with a fixed grid of 2D keypoints. Depth maps and camera extrinsics are predicted via spatial transformers (e.g., VGGT from SpatialTrackerV2). CoTracker3 tracks 2D keypoints, which are "lifted" into 3D world coordinates and aligned to the reference camera via extrinsic transformations.
World-to-Camera Alignment: Each 3D point is projected onto the image plane, yielding $(x, y, z)$ coordinates per keypoint and timestep.
Speed Retargeting: Trajectories are parameterized by normalized arc-length and resampled to uniform temporal grids, aligning actions across varying speeds and embodiments.

TraceForge was evaluated across eight video sources (robot laboratory and diverse in-the-wild human recordings) to yield 123K clips and 1.8M triplets (observation, trace, language), supporting scalable, cross-domain world modeling for TraceGen (Lee et al., 26 Nov 2025).

3. TraceGen World Model Architecture

TraceGen is a conditional flow-based generative model predicting future 3D trace increments from current RGB-D observations and textual task context.

Multi-Stream Encoding: Input RGB is encoded via frozen DINOv3 (ViT-L/16) and SigLIP, while depth is passed through a learnable adapter plus SigLIP. Text instructions are encoded via frozen T5-Base. Features are concatenated and projected to obtain a conditioning tensor $F_{\rm cond}$ .
Flow-Based Trace Decoding: The model predicts velocity increments $\Delta T^t = T^{t+1} - T^t$ instead of positions, facilitating stable learning and efficient integration. Spatially, the 400 keypoints are patchified and processed by an adapted CogVideoX 3D transformer decoder, conditioned via Adap-LayerNorm injected with $F_{\rm cond}$ .
Stochastic Interpolant Training: Given ground truth increments $\mathbf X^1 = \Delta T$ and noise $\mathbf X^0 \sim \mathcal N(0, I)$ , the model is trained via flow matching: $\mathcal L_{\mathrm{SI}} = \mathbb E_{\tau, \mathbf X^0, \mathbf X^1} \Bigl\| v_\theta(\mathbf X^\tau, \tau, F_{\rm cond}) - (\mathbf X^1 - \mathbf X^0) \Bigr\|_2^2$ where $\mathbf X^\tau = (1-\tau)\mathbf X^0 + \tau \mathbf X^1$ , $\tau \in [0,1]$ .
Inference Protocol: At test time, the ODE $\dot{\mathbf X}^\tau = v_\theta(\mathbf X^\tau, \tau, F_{\rm cond})$ (with 100 integration steps) generates $\Delta T$ , which is cumulated to obtain full keypoint trajectories and executed by robot controllers via inverse kinematics.

All encoder weights are frozen during adaptation; only the flow decoder is fine-tuned for target domains (Lee et al., 26 Nov 2025).

4. Adaptation Protocols and Transfer

TraceGen achieves rapid adaptation to new scenes, embodiments, and tasks using only a small number of target videos:

Robot→Robot Adaptation: Finetuning is performed on as few as 5 in-domain robot demonstrations (differing in objects and poses) for 5–10 epochs, aligning the pretrained motion prior with new kinematics. No architecture or learning rate changes are required.
Human→Robot Transfer: Human demonstrations, uncalibrated and recorded via handheld devices, are processed through TraceForge for 3D trace extraction. The decoder is then fine-tuned in the same manner, allowing domain transfer without the need for object detectors or dense pixel supervision.

Notably, only about 3–4 minutes of footage (5 short videos) are required for effective transfer, substantially reducing sample complexity compared to conventional world models (Lee et al., 26 Nov 2025).

5. Quantitative Results and Empirical Benchmarks

TraceGen's empirical performance demonstrates its practical strengths:

Task Generalization and Efficiency: On the Franka R3 robot, TraceGen pretrained with TraceForge corpus and 5 video warm-up achieves 80% success across four manipulation tasks after adaptation. With only 5 uncalibrated human demonstration videos, success remains at 67.5%, compared to 0% for models trained from scratch.
Inference Speed: TraceGen runs at ~200 predictions/minute (RTX A5000), compared to 3DFlowAction’s ~50/min and NovaFlow/Veo’s <4/min, realizing a 50–600× speedup over pixel-space models.
Ablations: Performance gains are attributable to both embodiment alignment (robot data) and linguistic/motion diversity (human and robot) in pretraining. Long-horizon tasks sustain high stepwise success rates, whereas pixel-space models degrade significantly (Lee et al., 26 Nov 2025).

A plausible implication is that the trace-space paradigm offers a compelling trade-off for robot learning: high sample efficiency, fast inference, and genuinely cross-embodiment generality without heavy reliance on object-centric or camera-calibration priors.

6. Limitations and Extension Opportunities

TraceGen presents several notable limitations and directions for further work:

ODE Schedule: TraceGen currently employs a linear interpolant in flow-matching. Richer schedules may improve distributional expressivity.
Demonstration Noise: Noisy or corrective demonstration motions can introduce suboptimal supervision. Improved data filtering is needed for optimal policy learning.
Zero-Shot Feasibility: Without warm-up, physically infeasible traces can be generated for novel embodiments. This suggests further modeling constraints or embodiment cues may be needed.
Precision and Density: Very fine manipulation tasks or complex, non-humanlike platforms (e.g., quadrupeds) may require higher spatial resolution or additional adaptation modules beyond current 20×20 grids.
Downstream Policies: Scene-centric trace predictions must be converted to platform-specific joint commands via separate layers (e.g., inverse kinematics), constituting an additional adaptation step (Lee et al., 26 Nov 2025).

7. Connections to Broader Trace-Generation Paradigms

The TraceGen terminology encompasses a spectrum of generative trace methodologies:

Symbolic Control Flow Trace Generation: Tatsu instantiates TraceGen as bounded unwinding plus SMT-based instance generation over control-flow skeletons for programming exercises. Here, traces are the execution paths induced by program semantics, realized via quantifier-free SSA translation and Z3 model extraction. The approach yields rich pools of semantically grounded trace instances, with formal guarantees on path coverage and tool limitations in floating-point, objects, and quantifier-rich invariants (Eisenhofer et al., 2022).
Uniform Infinite Trace Generation: Techniques for on-the-fly uniform sampling of infinite process traces leverage trace monoids, Möbius polynomials, and randomized recursive decomposition. These algorithms provide correctness and distributional guarantees, with rejection-based and rejection-less variants trading memory and speed, and explicit analysis for chordal/low-treewidth commutation graphs (Abbes et al., 2 Oct 2024).
Trace-Based Provenance in Content Generation: In digital watermarking and tamper localization, traces refer to embedded provenance signals extracted via frequency-coordinated decoders, structured over DCT bands to support robust provenance and manipulation detection in generative image pipelines (Gan et al., 28 Apr 2025).

Across these instantiations, TraceGen denotes the generative modeling of trace-structured data—sequences encoding program execution, process concurrency, manipulation trajectories, or content provenance—realized via rigorous algorithmic, formal, and learning-based frameworks.