TraceForge-123K Dataset
- TraceForge-123K is a comprehensive multimodal dataset comprising 123K clips and 1.8M synchronized triplets of video, 3D traces, and language instructions, tailored for world modeling and motion learning.
- The dataset employs a rigorous four-stage pipeline—including event chunking, 3D point tracking, world-to-camera alignment, and speed retargeting—to generate uniform, geometry-centric demonstrations.
- Benchmark evaluations show high data efficiency with rapid trace generation (~1 second for 32-step traces) and up to 80% success rates in robotic tasks, facilitating rapid cross-embodiment adaptation.
TraceForge-123K is a large-scale dataset comprising 123,000 video clips and 1.8 million synchronized multimodal triplets of observation, 3D trace, and natural-language instruction. Developed as part of TraceGen’s world modeling framework, TraceForge-123K provides the foundation for learning transferable 3D motion priors from cross-embodiment human and robot demonstrations. Its consistent, geometry-centric trace format and rich annotation schema facilitate research in world model pretraining, representation learning, and rapid task adaptation in robotics (Lee et al., 26 Nov 2025).
1. Construction Pipeline
TraceForge employs a four-stage pipeline to process raw video from heterogeneous sources—including in-lab robot, bimanual robot, egocentric human, and internet videos—into standardized demonstration episodes:
- Event Chunking & Instruction Generation: Each input video , regardless of embodiment, is segmented into task-relevant clips . Chunking uses available start–end markers or discards intervals with negligible motion detected by point-tracking magnitude. For each segment, a vision-LLM generates up to four instructions in JSON: a concise imperative, a multi-step directive, a human-like request, and, if available, a human-annotated caption.
- 3D Point Tracking with Pose & Depth Estimation: Each clip’s reference frame is overlaid with a uniform grid of keypoints. The TAPIP3D+CoTracker3 tracker infers per-frame camera extrinsics and depth maps , reconstructing each keypoint’s trajectory . For approximately 20% of human-demo clips lacking reliable depth, only 2D traces are produced.
- World-to-Camera Alignment: World-frame points are transformed to the reference camera frame:
providing screen-aligned traces compensating for camera motion.
- Speed Retargeting: Each raw trace is parameterized by cumulative arc length and resampled at uniform values:
This ensures fixed-length, embodiment-agnostic traces across samples while preserving velocity profiles.
2. 3D Trace and Multimodal Triplet Representation
Each datum in TraceForge-123K is a (visual observation, 3D trace, language instruction) triplet:
- Trace:
where (for a keypoint grid) and is typically 32 uniformly sampled timesteps, representing 1–2 seconds of future motion. The coordinates are pixel-aligned in the reference view; is depth in camera units.
- Observation and Language:
Each trace is synchronized with corresponding visual frames (raw RGB), and up to four textual instructions. Language annotations include concise commands, multi-step directives, and natural human requests.
- Modeling Protocol:
TraceGen models the temporal differences:
and employs a flow-based decoder to integrate these via an ODE. Pointwise mean error after decoding averages approximately 2 cm.
3. Annotation Schema and Data Format
Each clip is represented as a single JSON object with fields:
| Field | Description | Example/Format |
|---|---|---|
| video_id, chunk_id | Unique identifiers for source video and segment | "XYZ123", 5 |
| frames | List of frame indices included in the chunk | [10, 11, ..., 42] |
| cam_intrinsics | Camera calibration for reference view | {...} |
| cam_extrinsics | Per-frame world-to-camera transforms | [...] |
| depth_maps | Per-frame depth arrays; missing if 2D-only | [...] or null |
| trace | [K, L, 3] array: 400 keypoints, 32 timesteps, (x, y, z) per point | [[[x₁¹, y₁¹, z₁¹], ..., [x₁ᴸ, y₁ᴸ, z₁ᴸ]], …] |
| instructions | Object with up to four textual instructions (including human-authored if present) | See below |
Minimal annotation example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{
"video_id": "XYZ123",
"chunk_id": 5,
"frames": [10, 11, ..., 42],
"cam_intrinsics": {...},
"cam_extrinsics": [...],
"trace": [[[x₁¹, y₁¹, z₁¹], …, [x₁ᴸ, y₁ᴸ, z₁ᴸ]], ...],
"instructions": {
"instruction_1": "Pick up the block.",
"instruction_2": "Move gripper above block, lower and grasp.",
"instruction_3": "Could you pick up that block for me?"
}
} |
4. Dataset Statistics
TraceForge-123K’s key attributes are summarized below:
| Statistic | Value / Mode |
|---|---|
| Total clips | 123,000 |
| Total triplets | 1,800,000 |
| Human-demo clips | ~60,000 (20% 2D-only) |
| Robot-demo clips | ~63,000 |
| Distinct manipulation tasks | ~50 |
| Avg. clip length | 15–20 s (~450–600 frames) |
| Trace horizon | 32 timesteps (1–2 s) |
| Avg. spatial displacement | ~0.7 m |
| Language: unique instruction tokens | ~5,000 |
| Avg. instruction length | 8–12 words |
This suggests extensive coverage of manipulation actions and broad linguistic diversity, enhancing generalization for cross-task and cross-domain learning.
5. Data Splits, Benchmarks, and Model Performance
The dataset is partitioned as follows:
- Training: 100,000 clips (1.5 million triplets)
- Validation: 11,500 clips
- Test: 11,500 clips
TraceForge-123K supports benchmarking on four real-robot manipulation tasks (Clothes, Ball, Brush, Block) using the Franka R3 arm. Benchmark results for the TraceGen world model are:
- Robot→Robot (“warmup,” 5 demonstrations): 80% success rate (improving to 82.5% with 15 demos; 0% in zero-shot).
- Human→Robot (5 uncalibrated phone videos): 67.5% success on a real robot.
- Baselines: NovaFlow and AVDC world models achieve 0–30% success and are >50× slower; prior 3D flow methods (3DFlowAction) require masks and give 0% zero-shot success.
- Inference Speed: TraceGen produces 32-step traces in ~1 second (i.e., 50–600× faster than video-based methods).
This suggests that TraceForge-123K enables high data efficiency and rapid adaptation, including in challenging cross-embodiment scenarios.
6. Usage, Access, and Licensing
Recommended uses include:
- Pretraining 3D motion priors for rapid adaptation between embodiments (e.g., human to robot).
- World-model finetuning in trace space for new robots, tools, or environments.
- Self-supervised representation learning combining geometric, visual, and linguistic cues.
Access:
Download is available at https://tracegen.github.io under a MIT-style research license. Users are required to cite “S. Lee et al., TraceGen: World Modeling in 3D Trace-Space… CVPR (2025)” and include dataset version and DOI in publications.
This ensures reproducibility and encourages consistent benchmarking.
7. Significance and Research Context
TraceForge-123K establishes a large-scale, multimodal corpus for representation learning, cross-embodiment skill transfer, and data-efficient policy adaptation in robotic manipulation. Its centimeter-accurate, speed-retargeted 3D traces and comprehensive language instructions abstract from domain- and embodiment-specific cues, providing a unifying geometric substrate for model-driven prediction, control, and imitation learning. Its adoption in TraceGen demonstrates substantial improvements in generalization and inference speed over prior video-based and 3D flow models, without requiring object segmentation or dense pixel-space generation (Lee et al., 26 Nov 2025).