Papers
Topics
Authors
Recent
2000 character limit reached

TraceForge Data Engine

Updated 1 December 2025
  • TraceForge Data Engine is a modular pipeline that converts heterogeneous video streams into uniform 3D trace datasets through event chunking, instruction generation, and geometric alignment.
  • It employs advanced techniques such as 3D point tracking, camera pose and depth estimation, and speed retargeting to ensure cross-embodiment consistency with centimeter-level accuracy.
  • The engine streamlines large-scale world-model pretraining by integrating robust data augmentation, quality filtering, and efficient processing across diverse robotic and human demonstrations.

The TraceForge Data Engine is a modular pipeline for processing heterogeneous video sources—including robot, egocentric human, and in-the-wild footage—into unified 3D trajectory ("trace") datasets suitable for cross-embodiment world-model pretraining. Developed to support the TraceGen world model, TraceForge enables learning from observations that transcend embodiment, camera, and environment, producing a large-scale corpus of 123,000 videos and 1.8 million observation–trace–language triplets with centimeter-level endpoint accuracy (Lee et al., 26 Nov 2025).

1. System Architecture and Data Flow

TraceForge operates as a linear data pipeline with five core stages, each performing a transformation to normalize, abstract, and align raw video streams for downstream learning tasks. The key modules are:

  1. Video Ingestion and Event Chunking: Inputs are raw RGB or RGB-D videos from mixed human and robot sources. Videos are segmented into “event chunks,” each representing a single manipulation attempt. When frame-level labels are absent, chunk boundaries are established via a motion-magnitude filter on 2D keypoints.
  2. Instruction Generation: Each chunk is represented by sampled frames (begin/middle/end), and a vision–LLM (VLM) generates up to four textual instructions in varying styles—concise imperatives, stepwise commands, and natural requests.
  3. 3D Point Tracking with Pose & Depth: This stage estimates camera pose/extrinsics (Rt,tt)(R_t, t_t) and depth Dt(u,v)D_t(u,v) per frame via a pretrained network (VGGT). A uniform 20×2020\times20 image grid defines 400 keypoints, which CoTracker3 tracks across frames. Each 2D keypoint is lifted to 3D via:

Xi,tc=Dt(ui,t,vi,t)K1[ui,t,vi,t,1]X^c_{i,t} = D_t(u_{i,t},v_{i,t})\cdot K^{-1}[u_{i,t},\,v_{i,t},\,1]^\top

where KK is the intrinsic matrix.

  1. World-to-Camera Alignment: All 3D traces are transformed into a unified reference frame (the “screen-aligned” system of a reference time treft_{\mathrm{ref}}):

Xi,tref=Rref(RtXi,tc+tttref)X^{\mathrm{ref}}_{i,t} = R_{\mathrm{ref}}^\top(R_t X^c_{i,t} + t_t - t_{\mathrm{ref}})

Projected back with screen coordinates, forming traces Ti,t=(ui,t,vi,t,zi,t)T_{i,t} = (u_{i,t}, v_{i,t}, z_{i,t}).

  1. Speed Retargeting: Per-point arc length

si,t=τ=0t1Ti,τ+1Ti,τs_{i,t} = \sum_{\tau=0}^{t-1} \|T_{i,\tau+1} - T_{i,\tau}\|

normalizes rates and resamples trajectories to fixed length LL via uniform arc-length intervals.

The output of TraceForge is a set of triplets (RGB/D observations+language,3D traces)(\text{RGB/D observations} + \text{language},\, \text{3D traces}), handed off for model pretraining.

Pseudocode Summary:

1
2
3
4
5
6
7
8
9
10
11
12
for each video V:
  chunks = event_chunking(V)
  for chunk in chunks:
    frames = sample_frames(chunk)
    instrs = generate_instructions(frames)
    poses, depths = VGGT(frames)
    ks = init_grid(20,20)
    tracks2D = CoTracker3.track(frames, ks)
    tracks3D = lift_to_3D(tracks2D, depths, K)
    aligned = world_to_cam(tracks3D, poses, ref=chunk.start)
    normed = speed_retarget(aligned, L=32)
    save_triplet(frames, instrs, normed)

2. Core Algorithms and Mathematical Formulations

TraceForge incorporates several computer vision and geometry algorithms:

  • Camera Pose & Depth Estimation: Fast single-frame predictions via VGGT, supporting refinement through bundle adjustment:

min{Rt,tt},Xii,tπ(RtXi+tt)ki,t2+λiXi2\min_{\{R_t, t_t\}, X_i} \sum_{i,t} \| \pi(R_t X_i + t_t) - k_{i,t} \|^2 + \lambda \sum_i \| X_i \|^2

where π\pi is projection and ki,tk_{i,t} image keypoints.

  • 2D–3D Point Tracking: CoTracker3 maximizes the match between feature descriptors ϕ\phi across frames:

k^i,t+1=argmaxu,vϕ(It,u,v),ϕ(It+1,ui,t,vi,t)\hat k_{i,t+1} = \arg\max_{u', v'} \langle \phi(I_t, u', v'), \phi(I_{t+1}, u_{i,t}, v_{i,t}) \rangle

  • World-to-Camera Transform: All traces are centered and aligned using a reference frame’s pose, simplifying downstream learning.
  • Arc-Length Speed Normalization: Trajectories are uniformly resampled via their arc-length, standardizing representation:

Si,t=τ=0t1Xi,τ+1refXi,τrefS_{i,t} = \sum_{\tau=0}^{t-1} \| X^{\mathrm{ref}}_{i,\tau+1} - X^{\mathrm{ref}}_{i,\tau} \|

and resampled at S=Smax/L,=0LS' = S_{\max} \cdot \ell / L,\, \ell = 0\ldots L.

  • Regularization & Filtering: Outputs are filtered if motion is below 5 cm, or jitter exceeds a threshold, with optional trace-quality constraints:

vartTi,t+1Ti,t<ϵmin\mathrm{var}_t \|T_{i,t+1} - T_{i,t}\| < \epsilon_{\mathrm{min}}

3. Cross-Embodiment and Cross-Environment Normalization

TraceForge enforces invariance across embodiment (human; various robots) and environmental factors:

  • Unified Reference Frame: All data is re-expressed in a camera-independent coordinate system, eliminating explicit dependence on intrinsics/extrinsics during model learning.
  • Speed Normalization: Trajectories are time-warped to eliminate variation arising from embodiment-dependent speeds, e.g., slow robotic arms vs. rapid human hand movements.
  • Robustness via Data Augmentation: Mixing 80% 3D tracks with 20% pure 2D (for low-depth scenarios) increases coverage. During pretraining, random perturbations in extrinsics and depth-scaling further regularize the dataset, improving tolerance to unseen camera setups.

A plausible implication is that this geometric and temporal standardization facilitates robust cross-embodiment knowledge transfer for world-model learning.

4. Implementation Details and Performance

The TraceForge pipeline employs a Python/PyTorch software stack with the following dependencies:

  • Point tracking: TAPIP3D/CoTracker3
  • Depth and pose estimation: VGGT (from SpatialTrackerV2)
  • Language generation: vision–LLM (Flamingo-style)
  • Hardware: NVIDIA A5000 GPUs

Throughput and Efficiency

  • 3D tracking: ~0.1 s/frame
  • Chunking + Language: ~0.02 s/chunk
  • World-alignment + resampling: ~0.01 s/chunk
  • End-to-end: ~4 min per 100 chunks on 4 GPUs
  • Total compute: ~400 GPU-hours for 123,000 videos

Corpus Statistics

Metric Value
# Videos 123,000
# Triplets 1.8 million
Avg. chunk length ~5 seconds @ 10 fps

Quality controls include filtering traces for minimal motion/jitter and manual spot-checks of 1% of the data, yielding a sub-2.3 cm endpoint error with respect to robot ground truth.

5. Dataset Properties and Preprocessing

TraceForge produces a large-scale, highly curated, and automatically quality-assured corpus of task-centric manipulation trajectories:

  • Corpus Scale: 123,000 video chunks, 1.8M observation–trace–language triplets
  • Preprocessing: Filtering (motion << 5 cm; jitter >> 5 cm standard deviation), random cropping, symmetrical horizontal flips, synthetic depth noise (±10%)
  • Quality Validation: Manual spot checks confirm low endpoint error and labeling fidelity

This systematic preprocessing ensures data suitability for generalizable world-model pretraining across robotic and human embodiments.

6. Insights, Limitations, and Future Directions

Strengths

  • Abstraction into 3D trace space strips away appearance, preserving geometric structure for manipulation modeling.
  • Robust transfer across human and robot embodiments
  • Low-data adaptation: Only five demonstrations suffice for warmup fine-tuning.
  • Scalable, with sub-second per-chunk throughput on commodity GPUs.

Limitations

  • Reliance on depth-pose networks (e.g., VGGT) may result in degraded performance on reflective/translucent surfaces.
  • Human demonstration noise (such as exploratory gestures) can propagate into priors.
  • Fine-grained manipulations (e.g., threading a needle) may demand denser keypoint sampling or object-centric refinement.

Prospective Enhancements

  • Integration of multi-view bundle adjustment / ICP for stronger geometric consistency.
  • Automated trace-quality scoring to prune suboptimal human demonstrations.
  • Extension to non-articulated robot morphologies via adaptive keypoint templates.
  • Scaling to real-time, internet-scale video curation with automatic filtering and event chunking.

TraceForge exemplifies a robust, efficient, and scalable engine for constructing 3D trace datasets from heterogeneous, cross-embodiment videos, directly supporting the learning of highly generalizable manipulation priors (Lee et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TraceForge Data Engine.