Papers
Topics
Authors
Recent
2000 character limit reached

TraceForge: 3D Modeling & Forensic Tracing

Updated 28 November 2025
  • TraceForge is a dual-domain framework featuring a 3D robotic world modeling pipeline and a forensic identity tracing system for face-swapped media.
  • The robotic pipeline standardizes diverse video demonstrations into compact 3D traces using camera motion compensation, depth prediction, and keypoint tracking, achieving robust cross-embodiment learning.
  • The forensic system extends the FaceTracer framework to non-intrusively identify source identities, offering practical benefits for digital forensics in manipulated media.

TraceForge is the designation for two distinct, domain-specific systems: (1) a large-scale robotic action video curation and representation pipeline for world modeling in 3D trace space ("TraceGen" context (Lee et al., 26 Nov 2025)), and (2) an operational pipeline for source identity tracing in the context of face-swapped images and videos—a system building upon the FaceTracer framework (Zhang et al., 11 Dec 2024). The shared nomenclature is coincidental and the systems target separate tasks: the former addresses cross-embodiment learning via geometric motion unification, the latter non-intrusive forensic identification in face-swapping forensics.

1. Unified, Symbolic 3D Representation Pipeline for Robotic World Modeling

TraceForge, in the context of (Lee et al., 26 Nov 2025), is the data curation and preprocessing engine that standardizes demonstration videos from diverse sources (direct human, egocentric, robot, “in-the-wild” handheld) to yield a compact, scene-centric 3D trace suitable for transferable world modeling.

Design Objectives

  • Unification: Converts heterogeneous video sources—varying in embodiment, camera motion, viewpoint, and environment—into a common 3D trace format abstracted from appearance.
  • Camera Motion Compensation: Operates effectively without assumptions of static, calibrated cameras or object detectors.
  • Multimodal Alignment: Produces temporally aligned {RGB-D frame, 3D trace, language instruction} triplets enabling joint vision-language-motion learning.
  • Cross-Embodiment Compareability: Normalizes human and robot demonstration speeds for actionable cross-representation.
  • Scale and Diversity: Yields 123,000 video chunks and 1.8M triplets across eight sources for pretraining downstream world models.

2. Pipeline Architecture and Processing Flow

Main Stages

A. Event Chunking & Instruction Generation

  • Given raw video with optional segment labels, segments are created either directly or via motion-magnitude heuristics.
  • For each segment, three representative frames (start, mid, end) are extracted.
  • A vision-LLM auto-generates three instruction styles (imperative, stepwise, and natural request).

B. Camera Pose, Depth Prediction & 3D Keypoint Tracking

  • A 20×20 grid (400 keypoints) is initialized in the reference frame.
  • For each frame tt: predict depth map DtD_t and camera extrinsics (Rt,tt)(R_t, t_t) via a VGGT model (fine-tuned SpatialTrackerV2).
  • Dense 2D keypoint tracking across frames with CoTracker3.
  • Each keypoint is “lifted” to 3D via Zi,t=Dt(ui,t,vi,t)Z_{i,t} = D_t(u_{i,t}, v_{i,t}) and [Xc,Yc,Zc]=ZcK1[u,v,1][X^c, Y^c, Z^c]^\top = Z^c K^{-1} [u, v, 1]^\top where KK is the intrinsic matrix.

C. World-to-Camera Alignment

  • All per-frame 3D points are transformed into the canonical reference frame of the chunk using pi,tc=Rref(pi,twtref)p_{i,t}^c = R_{\rm ref}(p_{i,t}^w - t_{\rm ref}).
  • 3D traces are represented as Treft:t+LRK×L×3T_{\mathrm{ref}}^{t:t+L} \in \mathbb{R}^{K \times L \times 3}, i.e., screen-aligned sequences.

D. Speed Retargeting

  • For each trajectory Xi(t)X_i(t), cumulative arc-length s(t)=0tX˙i(τ)dτs(t)=\int_0^t \|\dot X_i(\tau)\| d\tau is computed.
  • The trace is resampled at uniform arc-length intervals to standardized length LL (e.g., 32), enabling normalized motion tempo for training.

3. Component Algorithms, Models, and Mathematical Formulation

Algorithms

Stage Model/Algorithm Used Purpose
Depth & Extrinsics VGGT (SpatialTrackerV2) Per-frame 3D lifting, camera reference normalization
2D Tracking CoTracker3 Robust 2D keypoint tracks over variable, moving-camera videos
Persistent Geometry TAPIP3D (Optional) Geometry refinement in complex scenarios
Language Instruction Gen Vision-LLM (JSON) Generate task/action instructions automatically
Speed Normalization Arc-length Reparam. Trajectory resampling by intrinsic motion arc-length

Key mathematical expressions:

  • Camera lifting: [Xc;Yc;Zc]=ZcK1[u;v;1][X^c; Y^c; Z^c] = Z^c K^{-1} [u; v; 1]
  • Reference frame projection: prefc=Rref(pwtref)p^c_{\mathrm{ref}} = R_{\mathrm{ref}}(p^w - t_{\mathrm{ref}})
  • 3D trace tensor: Treft:t+L={[ui,t,vi,t,Zi,tc]}i=1K,t=0LT_{\mathrm{ref}}^{t:t+L} = \{ [u_{i, t}, v_{i, t}, Z_{i, t}^c] \}_{i=1 \ldots K, t=0 \ldots L}

4. Dataset Normalization and Output Statistics

  • Dataset scale: 123,000 video chunks from 8 source domains, yielding 1.8M {RGB-D, trace, language} triplets.
  • Representation: Each chunk forms a 400×32×3 tensor (screen-aligned keypoints across 32 uniformly retimed steps).
  • Trace Modes: ~80% full 3D, ~20% fallback 2D-only (lacking depth or for static camera footage).
  • Depth normalization: At inference, predicted depth is rescaled via Gaussian-smoothed per-pixel adjustment to match sensor depth.
  • Instruction variability: Up to four textual variants per event chunk, preserving human-written plus auto-generated instructions.

5. Integration with TraceGen and Cross-Embodiment Few-Shot Learning

TraceForge’s normalized triplets directly serve as training (and adaptation) data for TraceGen, a flow-based generative world model operating in 3D trace space. For a new robot or manipulation domain:

  • A handful (5–15) of “warm-up” robot-specific videos, preprocessed by TraceForge, suffices for alignment of the motion prior to the new embodiment.
  • At deployment, TraceGen predicts future 3D traces (in the canonical camera reference) conditioned on RGB-D and language.
  • A lightweight inverse-kinematics module maps predicted traces to robot control in real time.
  • TraceGen, pretrained on TraceForge data, significantly improves sample efficiency and generalization: 80% mean success (with 5 robot videos) across four tasks; 67.5% even with only five in-the-wild human demo videos on a real robot; 50–600× faster inference than pixel-video models.

A plausible implication is that such trace-space normalization with language grounding enables highly reliable cross-embodiment skill transfer without explicit object detection or pixel-space synthesis, abstracting away most nuisance factors present in heterogeneous real demonstration video corpora (Lee et al., 26 Nov 2025).

6. Deployment Recommendations and Limitations

  • Deployment: Designed for scalable robotic skill and motion prior acquisition. Effective for research environments needing diverse multi-embodiment training data processed into a machine-actionable format.
  • Limitation: Depth reliability and trace quality degrade in scenes lacking sufficient structure for tracking or depth estimation; about 20% of traces are processed in 2D-only mode.
  • Future direction: Extension to richer manipulation semantics, more reliable unsupervised event segmentation, and direct policy learning on trace outputs.

For other usages such as identity tracing in forensic face-swapping scenarios, see the distinct TraceForge system built atop FaceTracer (Zhang et al., 11 Dec 2024).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TraceForge.