Papers
Topics
Authors
Recent
2000 character limit reached

TraceForge-123K Dataset

Updated 1 December 2025
  • TraceForge-123K is a comprehensive multimodal dataset comprising 123K clips and 1.8M synchronized triplets of video, 3D traces, and language instructions, tailored for world modeling and motion learning.
  • The dataset employs a rigorous four-stage pipeline—including event chunking, 3D point tracking, world-to-camera alignment, and speed retargeting—to generate uniform, geometry-centric demonstrations.
  • Benchmark evaluations show high data efficiency with rapid trace generation (~1 second for 32-step traces) and up to 80% success rates in robotic tasks, facilitating rapid cross-embodiment adaptation.

TraceForge-123K is a large-scale dataset comprising 123,000 video clips and 1.8 million synchronized multimodal triplets of observation, 3D trace, and natural-language instruction. Developed as part of TraceGen’s world modeling framework, TraceForge-123K provides the foundation for learning transferable 3D motion priors from cross-embodiment human and robot demonstrations. Its consistent, geometry-centric trace format and rich annotation schema facilitate research in world model pretraining, representation learning, and rapid task adaptation in robotics (Lee et al., 26 Nov 2025).

1. Construction Pipeline

TraceForge employs a four-stage pipeline to process raw video from heterogeneous sources—including in-lab robot, bimanual robot, egocentric human, and internet videos—into standardized demonstration episodes:

  1. Event Chunking & Instruction Generation: Each input video VinV_\mathrm{in}, regardless of embodiment, is segmented into task-relevant clips {Vi}\{V_i\}. Chunking uses available start–end markers or discards intervals with negligible motion detected by point-tracking magnitude. For each segment, a vision-LLM generates up to four instructions in JSON: a concise imperative, a multi-step directive, a human-like request, and, if available, a human-annotated caption.
  2. 3D Point Tracking with Pose & Depth Estimation: Each clip’s reference frame is overlaid with a uniform 20×2020 \times 20 grid of keypoints. The TAPIP3D+CoTracker3 tracker infers per-frame camera extrinsics {Tworldcamt}\{T_{\mathrm{world}\rightarrow\mathrm{cam}_t}\} and depth maps DtD_t, reconstructing each keypoint’s trajectory Treft:t+L=[(xit,yit,zit)]\mathbf{T}_{\mathrm{ref}}^{t:t+L} = [(x_i^t, y_i^t, z_i^t)]. For approximately 20% of human-demo clips lacking reliable depth, only 2D traces are produced.
  3. World-to-Camera Alignment: World-frame points Xworld\mathbf{X}_{\mathrm{world}} are transformed to the reference camera frame:

[Xc,Yc,Zc]=TrefXworld[X^c, Y^c, Z^c]^\top = T_{\mathrm{ref}}\,\mathbf{X}_{\mathrm{world}}

providing screen-aligned traces compensating for camera motion.

  1. Speed Retargeting: Each raw trace is parameterized by cumulative arc length s[0,1]s \in [0,1] and resampled at LL uniform values:

sit=u=1tpiupiu1u=1Lpiupiu1s_i^t = \frac{\sum_{u=1}^{t} \|\mathbf{p}_i^u - \mathbf{p}_i^{u-1}\|}{\sum_{u=1}^{L} \|\mathbf{p}_i^u - \mathbf{p}_i^{u-1}\|}

This ensures fixed-length, embodiment-agnostic traces across samples while preserving velocity profiles.

2. 3D Trace and Multimodal Triplet Representation

Each datum in TraceForge-123K is a (visual observation, 3D trace, language instruction) triplet:

  • Trace:

Tref1:L={pit=(xit,yit,zit)}i=1..K,t=1..L\mathbf{T}_{\mathrm{ref}}^{1:L} = \{\mathbf{p}_i^t = (x_i^t, y_i^t, z_i^t)\}_{i=1..K,\,t=1..L} where K=400K = 400 (for a 20×2020\times20 keypoint grid) and LL is typically 32 uniformly sampled timesteps, representing 1–2 seconds of future motion. The (x,y)(x, y) coordinates are pixel-aligned in the reference view; zz is depth in camera units.

  • Observation and Language:

Each trace is synchronized with corresponding visual frames (raw RGB), and up to four textual instructions. Language annotations include concise commands, multi-step directives, and natural human requests.

  • Modeling Protocol:

TraceGen models the temporal differences:

ΔTreft=Treft+1Treft\Delta\mathbf{T}_{\mathrm{ref}}^t = \mathbf{T}_{\mathrm{ref}}^{t+1} - \mathbf{T}_{\mathrm{ref}}^t

and employs a flow-based decoder to integrate these via an ODE. Pointwise mean error after decoding averages approximately 2 cm.

3. Annotation Schema and Data Format

Each clip is represented as a single JSON object with fields:

Field Description Example/Format
video_id, chunk_id Unique identifiers for source video and segment "XYZ123", 5
frames List of frame indices included in the chunk [10, 11, ..., 42]
cam_intrinsics Camera calibration for reference view {...}
cam_extrinsics Per-frame world-to-camera transforms [...]
depth_maps Per-frame depth arrays; missing if 2D-only [...] or null
trace [K, L, 3] array: 400 keypoints, 32 timesteps, (x, y, z) per point [[[x₁¹, y₁¹, z₁¹], ..., [x₁ᴸ, y₁ᴸ, z₁ᴸ]], …]
instructions Object with up to four textual instructions (including human-authored if present) See below

Minimal annotation example:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
  "video_id": "XYZ123",
  "chunk_id": 5,
  "frames": [10, 11, ..., 42],
  "cam_intrinsics": {...},
  "cam_extrinsics": [...],
  "trace": [[[x¹, y¹, z¹], , [x, y, z]], ...],
  "instructions": {
    "instruction_1": "Pick up the block.",
    "instruction_2": "Move gripper above block, lower and grasp.",
    "instruction_3": "Could you pick up that block for me?"
  }
}
This ensures standardized, machine-readable inputs for downstream model training or evaluation.

4. Dataset Statistics

TraceForge-123K’s key attributes are summarized below:

Statistic Value / Mode
Total clips 123,000
Total triplets 1,800,000
Human-demo clips ~60,000 (20% 2D-only)
Robot-demo clips ~63,000
Distinct manipulation tasks ~50
Avg. clip length 15–20 s (~450–600 frames)
Trace horizon LL 32 timesteps (1–2 s)
Avg. spatial displacement ~0.7 m
Language: unique instruction tokens ~5,000
Avg. instruction length 8–12 words

This suggests extensive coverage of manipulation actions and broad linguistic diversity, enhancing generalization for cross-task and cross-domain learning.

5. Data Splits, Benchmarks, and Model Performance

The dataset is partitioned as follows:

  • Training: 100,000 clips (1.5 million triplets)
  • Validation: 11,500 clips
  • Test: 11,500 clips

TraceForge-123K supports benchmarking on four real-robot manipulation tasks (Clothes, Ball, Brush, Block) using the Franka R3 arm. Benchmark results for the TraceGen world model are:

  • Robot→Robot (“warmup,” 5 demonstrations): 80% success rate (improving to 82.5% with 15 demos; 0% in zero-shot).
  • Human→Robot (5 uncalibrated phone videos): 67.5% success on a real robot.
  • Baselines: NovaFlow and AVDC world models achieve 0–30% success and are >50× slower; prior 3D flow methods (3DFlowAction) require masks and give 0% zero-shot success.
  • Inference Speed: TraceGen produces 32-step traces in ~1 second (i.e., 50–600× faster than video-based methods).

This suggests that TraceForge-123K enables high data efficiency and rapid adaptation, including in challenging cross-embodiment scenarios.

6. Usage, Access, and Licensing

Recommended uses include:

  • Pretraining 3D motion priors for rapid adaptation between embodiments (e.g., human to robot).
  • World-model finetuning in trace space for new robots, tools, or environments.
  • Self-supervised representation learning combining geometric, visual, and linguistic cues.

Access:

Download is available at https://tracegen.github.io under a MIT-style research license. Users are required to cite “S. Lee et al., TraceGen: World Modeling in 3D Trace-Space… CVPR (2025)” and include dataset version and DOI in publications.

This ensures reproducibility and encourages consistent benchmarking.

7. Significance and Research Context

TraceForge-123K establishes a large-scale, multimodal corpus for representation learning, cross-embodiment skill transfer, and data-efficient policy adaptation in robotic manipulation. Its centimeter-accurate, speed-retargeted 3D traces and comprehensive language instructions abstract from domain- and embodiment-specific cues, providing a unifying geometric substrate for model-driven prediction, control, and imitation learning. Its adoption in TraceGen demonstrates substantial improvements in generalization and inference speed over prior video-based and 3D flow models, without requiring object segmentation or dense pixel-space generation (Lee et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TraceForge-123K Dataset.