TraceForge Data Engine

Updated 1 December 2025

TraceForge Data Engine is a modular pipeline that converts heterogeneous video streams into uniform 3D trace datasets through event chunking, instruction generation, and geometric alignment.
It employs advanced techniques such as 3D point tracking, camera pose and depth estimation, and speed retargeting to ensure cross-embodiment consistency with centimeter-level accuracy.
The engine streamlines large-scale world-model pretraining by integrating robust data augmentation, quality filtering, and efficient processing across diverse robotic and human demonstrations.

The TraceForge Data Engine is a modular pipeline for processing heterogeneous video sources—including robot, egocentric human, and in-the-wild footage—into unified 3D trajectory ("trace") datasets suitable for cross-embodiment world-model pretraining. Developed to support the TraceGen world model, TraceForge enables learning from observations that transcend embodiment, camera, and environment, producing a large-scale corpus of 123,000 videos and 1.8 million observation–trace–language triplets with centimeter-level endpoint accuracy (Lee et al., 26 Nov 2025).

1. System Architecture and Data Flow

TraceForge operates as a linear data pipeline with five core stages, each performing a transformation to normalize, abstract, and align raw video streams for downstream learning tasks. The key modules are:

Video Ingestion and Event Chunking: Inputs are raw RGB or RGB-D videos from mixed human and robot sources. Videos are segmented into “event chunks,” each representing a single manipulation attempt. When frame-level labels are absent, chunk boundaries are established via a motion-magnitude filter on 2D keypoints.
Instruction Generation: Each chunk is represented by sampled frames (begin/middle/end), and a vision–LLM (VLM) generates up to four textual instructions in varying styles—concise imperatives, stepwise commands, and natural requests.
3D Point Tracking with Pose & Depth: This stage estimates camera pose/extrinsics $(R_t, t_t)$ and depth $D_t(u,v)$ per frame via a pretrained network (VGGT). A uniform $20\times20$ image grid defines 400 keypoints, which CoTracker3 tracks across frames. Each 2D keypoint is lifted to 3D via:

$X^c_{i,t} = D_t(u_{i,t},v_{i,t})\cdot K^{-1}[u_{i,t},\,v_{i,t},\,1]^\top$

where $K$ is the intrinsic matrix.

World-to-Camera Alignment: All 3D traces are transformed into a unified reference frame (the “screen-aligned” system of a reference time $t_{\mathrm{ref}}$ ):

$X^{\mathrm{ref}}_{i,t} = R_{\mathrm{ref}}^\top(R_t X^c_{i,t} + t_t - t_{\mathrm{ref}})$

Projected back with screen coordinates, forming traces $T_{i,t} = (u_{i,t}, v_{i,t}, z_{i,t})$ .

Speed Retargeting: Per-point arc length

$s_{i,t} = \sum_{\tau=0}^{t-1} \|T_{i,\tau+1} - T_{i,\tau}\|$

normalizes rates and resamples trajectories to fixed length $L$ via uniform arc-length intervals.

The output of TraceForge is a set of triplets $(\text{RGB/D observations} + \text{language},\, \text{3D traces})$ , handed off for model pretraining.

Pseudocode Summary:

for each video V:
  chunks = event_chunking(V)
  for chunk in chunks:
    frames = sample_frames(chunk)
    instrs = generate_instructions(frames)
    poses, depths = VGGT(frames)
    ks = init_grid(20,20)
    tracks2D = CoTracker3.track(frames, ks)
    tracks3D = lift_to_3D(tracks2D, depths, K)
    aligned = world_to_cam(tracks3D, poses, ref=chunk.start)
    normed = speed_retarget(aligned, L=32)
    save_triplet(frames, instrs, normed)

2. Core Algorithms and Mathematical Formulations

TraceForge incorporates several computer vision and geometry algorithms:

Camera Pose & Depth Estimation: Fast single-frame predictions via VGGT, supporting refinement through bundle adjustment:

$\min_{\{R_t, t_t\}, X_i} \sum_{i,t} \| \pi(R_t X_i + t_t) - k_{i,t} \|^2 + \lambda \sum_i \| X_i \|^2$

where $\pi$ is projection and $k_{i,t}$ image keypoints.

2D–3D Point Tracking: CoTracker3 maximizes the match between feature descriptors $\phi$ across frames:

$\hat k_{i,t+1} = \arg\max_{u', v'} \langle \phi(I_t, u', v'), \phi(I_{t+1}, u_{i,t}, v_{i,t}) \rangle$

World-to-Camera Transform: All traces are centered and aligned using a reference frame’s pose, simplifying downstream learning.
Arc-Length Speed Normalization: Trajectories are uniformly resampled via their arc-length, standardizing representation:

$S_{i,t} = \sum_{\tau=0}^{t-1} \| X^{\mathrm{ref}}_{i,\tau+1} - X^{\mathrm{ref}}_{i,\tau} \|$

and resampled at $S' = S_{\max} \cdot \ell / L,\, \ell = 0\ldots L$ .

Regularization & Filtering: Outputs are filtered if motion is below 5 cm, or jitter exceeds a threshold, with optional trace-quality constraints:

$\mathrm{var}_t \|T_{i,t+1} - T_{i,t}\| < \epsilon_{\mathrm{min}}$

3. Cross-Embodiment and Cross-Environment Normalization

TraceForge enforces invariance across embodiment (human; various robots) and environmental factors:

Unified Reference Frame: All data is re-expressed in a camera-independent coordinate system, eliminating explicit dependence on intrinsics/extrinsics during model learning.
Speed Normalization: Trajectories are time-warped to eliminate variation arising from embodiment-dependent speeds, e.g., slow robotic arms vs. rapid human hand movements.
Robustness via Data Augmentation: Mixing 80% 3D tracks with 20% pure 2D (for low-depth scenarios) increases coverage. During pretraining, random perturbations in extrinsics and depth-scaling further regularize the dataset, improving tolerance to unseen camera setups.

A plausible implication is that this geometric and temporal standardization facilitates robust cross-embodiment knowledge transfer for world-model learning.

4. Implementation Details and Performance

The TraceForge pipeline employs a Python/PyTorch software stack with the following dependencies:

Point tracking: TAPIP3D/CoTracker3
Depth and pose estimation: VGGT (from SpatialTrackerV2)
Language generation: vision–LLM (Flamingo-style)
Hardware: NVIDIA A5000 GPUs

Throughput and Efficiency

3D tracking: ~0.1 s/frame
Chunking + Language: ~0.02 s/chunk
World-alignment + resampling: ~0.01 s/chunk
End-to-end: ~4 min per 100 chunks on 4 GPUs
Total compute: ~400 GPU-hours for 123,000 videos

Corpus Statistics

Metric	Value
# Videos	123,000
# Triplets	1.8 million
Avg. chunk length	~5 seconds @ 10 fps

Quality controls include filtering traces for minimal motion/jitter and manual spot-checks of 1% of the data, yielding a sub-2.3 cm endpoint error with respect to robot ground truth.

5. Dataset Properties and Preprocessing

TraceForge produces a large-scale, highly curated, and automatically quality-assured corpus of task-centric manipulation trajectories:

Corpus Scale: 123,000 video chunks, 1.8M observation–trace–language triplets
Preprocessing: Filtering (motion $<$ 5 cm; jitter $>$ 5 cm standard deviation), random cropping, symmetrical horizontal flips, synthetic depth noise (±10%)
Quality Validation: Manual spot checks confirm low endpoint error and labeling fidelity

This systematic preprocessing ensures data suitability for generalizable world-model pretraining across robotic and human embodiments.

6. Insights, Limitations, and Future Directions

Strengths

Abstraction into 3D trace space strips away appearance, preserving geometric structure for manipulation modeling.
Robust transfer across human and robot embodiments
Low-data adaptation: Only five demonstrations suffice for warmup fine-tuning.
Scalable, with sub-second per-chunk throughput on commodity GPUs.

Limitations

Reliance on depth-pose networks (e.g., VGGT) may result in degraded performance on reflective/translucent surfaces.
Human demonstration noise (such as exploratory gestures) can propagate into priors.
Fine-grained manipulations (e.g., threading a needle) may demand denser keypoint sampling or object-centric refinement.

Prospective Enhancements

Integration of multi-view bundle adjustment / ICP for stronger geometric consistency.
Automated trace-quality scoring to prune suboptimal human demonstrations.
Extension to non-articulated robot morphologies via adaptive keypoint templates.
Scaling to real-time, internet-scale video curation with automatic filtering and event chunking.

TraceForge exemplifies a robust, efficient, and scalable engine for constructing 3D trace datasets from heterogeneous, cross-embodiment videos, directly supporting the learning of highly generalizable manipulation priors (Lee et al., 26 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TraceForge Data Engine.