TraceForge Data Engine
- TraceForge Data Engine is a modular pipeline that converts heterogeneous video streams into uniform 3D trace datasets through event chunking, instruction generation, and geometric alignment.
- It employs advanced techniques such as 3D point tracking, camera pose and depth estimation, and speed retargeting to ensure cross-embodiment consistency with centimeter-level accuracy.
- The engine streamlines large-scale world-model pretraining by integrating robust data augmentation, quality filtering, and efficient processing across diverse robotic and human demonstrations.
The TraceForge Data Engine is a modular pipeline for processing heterogeneous video sources—including robot, egocentric human, and in-the-wild footage—into unified 3D trajectory ("trace") datasets suitable for cross-embodiment world-model pretraining. Developed to support the TraceGen world model, TraceForge enables learning from observations that transcend embodiment, camera, and environment, producing a large-scale corpus of 123,000 videos and 1.8 million observation–trace–language triplets with centimeter-level endpoint accuracy (Lee et al., 26 Nov 2025).
1. System Architecture and Data Flow
TraceForge operates as a linear data pipeline with five core stages, each performing a transformation to normalize, abstract, and align raw video streams for downstream learning tasks. The key modules are:
- Video Ingestion and Event Chunking: Inputs are raw RGB or RGB-D videos from mixed human and robot sources. Videos are segmented into “event chunks,” each representing a single manipulation attempt. When frame-level labels are absent, chunk boundaries are established via a motion-magnitude filter on 2D keypoints.
- Instruction Generation: Each chunk is represented by sampled frames (begin/middle/end), and a vision–LLM (VLM) generates up to four textual instructions in varying styles—concise imperatives, stepwise commands, and natural requests.
- 3D Point Tracking with Pose & Depth: This stage estimates camera pose/extrinsics and depth per frame via a pretrained network (VGGT). A uniform image grid defines 400 keypoints, which CoTracker3 tracks across frames. Each 2D keypoint is lifted to 3D via:
where is the intrinsic matrix.
- World-to-Camera Alignment: All 3D traces are transformed into a unified reference frame (the “screen-aligned” system of a reference time ):
Projected back with screen coordinates, forming traces .
- Speed Retargeting: Per-point arc length
normalizes rates and resamples trajectories to fixed length via uniform arc-length intervals.
The output of TraceForge is a set of triplets , handed off for model pretraining.
Pseudocode Summary:
1 2 3 4 5 6 7 8 9 10 11 12 |
for each video V: chunks = event_chunking(V) for chunk in chunks: frames = sample_frames(chunk) instrs = generate_instructions(frames) poses, depths = VGGT(frames) ks = init_grid(20,20) tracks2D = CoTracker3.track(frames, ks) tracks3D = lift_to_3D(tracks2D, depths, K) aligned = world_to_cam(tracks3D, poses, ref=chunk.start) normed = speed_retarget(aligned, L=32) save_triplet(frames, instrs, normed) |
2. Core Algorithms and Mathematical Formulations
TraceForge incorporates several computer vision and geometry algorithms:
- Camera Pose & Depth Estimation: Fast single-frame predictions via VGGT, supporting refinement through bundle adjustment:
where is projection and image keypoints.
- 2D–3D Point Tracking: CoTracker3 maximizes the match between feature descriptors across frames:
- World-to-Camera Transform: All traces are centered and aligned using a reference frame’s pose, simplifying downstream learning.
- Arc-Length Speed Normalization: Trajectories are uniformly resampled via their arc-length, standardizing representation:
and resampled at .
- Regularization & Filtering: Outputs are filtered if motion is below 5 cm, or jitter exceeds a threshold, with optional trace-quality constraints:
3. Cross-Embodiment and Cross-Environment Normalization
TraceForge enforces invariance across embodiment (human; various robots) and environmental factors:
- Unified Reference Frame: All data is re-expressed in a camera-independent coordinate system, eliminating explicit dependence on intrinsics/extrinsics during model learning.
- Speed Normalization: Trajectories are time-warped to eliminate variation arising from embodiment-dependent speeds, e.g., slow robotic arms vs. rapid human hand movements.
- Robustness via Data Augmentation: Mixing 80% 3D tracks with 20% pure 2D (for low-depth scenarios) increases coverage. During pretraining, random perturbations in extrinsics and depth-scaling further regularize the dataset, improving tolerance to unseen camera setups.
A plausible implication is that this geometric and temporal standardization facilitates robust cross-embodiment knowledge transfer for world-model learning.
4. Implementation Details and Performance
The TraceForge pipeline employs a Python/PyTorch software stack with the following dependencies:
- Point tracking: TAPIP3D/CoTracker3
- Depth and pose estimation: VGGT (from SpatialTrackerV2)
- Language generation: vision–LLM (Flamingo-style)
- Hardware: NVIDIA A5000 GPUs
Throughput and Efficiency
- 3D tracking: ~0.1 s/frame
- Chunking + Language: ~0.02 s/chunk
- World-alignment + resampling: ~0.01 s/chunk
- End-to-end: ~4 min per 100 chunks on 4 GPUs
- Total compute: ~400 GPU-hours for 123,000 videos
Corpus Statistics
| Metric | Value |
|---|---|
| # Videos | 123,000 |
| # Triplets | 1.8 million |
| Avg. chunk length | ~5 seconds @ 10 fps |
Quality controls include filtering traces for minimal motion/jitter and manual spot-checks of 1% of the data, yielding a sub-2.3 cm endpoint error with respect to robot ground truth.
5. Dataset Properties and Preprocessing
TraceForge produces a large-scale, highly curated, and automatically quality-assured corpus of task-centric manipulation trajectories:
- Corpus Scale: 123,000 video chunks, 1.8M observation–trace–language triplets
- Preprocessing: Filtering (motion 5 cm; jitter 5 cm standard deviation), random cropping, symmetrical horizontal flips, synthetic depth noise (±10%)
- Quality Validation: Manual spot checks confirm low endpoint error and labeling fidelity
This systematic preprocessing ensures data suitability for generalizable world-model pretraining across robotic and human embodiments.
6. Insights, Limitations, and Future Directions
Strengths
- Abstraction into 3D trace space strips away appearance, preserving geometric structure for manipulation modeling.
- Robust transfer across human and robot embodiments
- Low-data adaptation: Only five demonstrations suffice for warmup fine-tuning.
- Scalable, with sub-second per-chunk throughput on commodity GPUs.
Limitations
- Reliance on depth-pose networks (e.g., VGGT) may result in degraded performance on reflective/translucent surfaces.
- Human demonstration noise (such as exploratory gestures) can propagate into priors.
- Fine-grained manipulations (e.g., threading a needle) may demand denser keypoint sampling or object-centric refinement.
Prospective Enhancements
- Integration of multi-view bundle adjustment / ICP for stronger geometric consistency.
- Automated trace-quality scoring to prune suboptimal human demonstrations.
- Extension to non-articulated robot morphologies via adaptive keypoint templates.
- Scaling to real-time, internet-scale video curation with automatic filtering and event chunking.
TraceForge exemplifies a robust, efficient, and scalable engine for constructing 3D trace datasets from heterogeneous, cross-embodiment videos, directly supporting the learning of highly generalizable manipulation priors (Lee et al., 26 Nov 2025).