TraceForge: 3D Modeling & Forensic Tracing

Updated 28 November 2025

TraceForge is a dual-domain framework featuring a 3D robotic world modeling pipeline and a forensic identity tracing system for face-swapped media.
The robotic pipeline standardizes diverse video demonstrations into compact 3D traces using camera motion compensation, depth prediction, and keypoint tracking, achieving robust cross-embodiment learning.
The forensic system extends the FaceTracer framework to non-intrusively identify source identities, offering practical benefits for digital forensics in manipulated media.

TraceForge is the designation for two distinct, domain-specific systems: (1) a large-scale robotic action video curation and representation pipeline for world modeling in 3D trace space ("TraceGen" context (Lee et al., 26 Nov 2025)), and (2) an operational pipeline for source identity tracing in the context of face-swapped images and videos—a system building upon the FaceTracer framework (Zhang et al., 11 Dec 2024). The shared nomenclature is coincidental and the systems target separate tasks: the former addresses cross-embodiment learning via geometric motion unification, the latter non-intrusive forensic identification in face-swapping forensics.

1. Unified, Symbolic 3D Representation Pipeline for Robotic World Modeling

TraceForge, in the context of (Lee et al., 26 Nov 2025), is the data curation and preprocessing engine that standardizes demonstration videos from diverse sources (direct human, egocentric, robot, “in-the-wild” handheld) to yield a compact, scene-centric 3D trace suitable for transferable world modeling.

Design Objectives

Unification: Converts heterogeneous video sources—varying in embodiment, camera motion, viewpoint, and environment—into a common 3D trace format abstracted from appearance.
Camera Motion Compensation: Operates effectively without assumptions of static, calibrated cameras or object detectors.
Multimodal Alignment: Produces temporally aligned {RGB-D frame, 3D trace, language instruction} triplets enabling joint vision-language-motion learning.
Cross-Embodiment Compareability: Normalizes human and robot demonstration speeds for actionable cross-representation.
Scale and Diversity: Yields 123,000 video chunks and 1.8M triplets across eight sources for pretraining downstream world models.

2. Pipeline Architecture and Processing Flow

Main Stages

A. Event Chunking & Instruction Generation

Given raw video with optional segment labels, segments are created either directly or via motion-magnitude heuristics.
For each segment, three representative frames (start, mid, end) are extracted.
A vision-LLM auto-generates three instruction styles (imperative, stepwise, and natural request).

B. Camera Pose, Depth Prediction & 3D Keypoint Tracking

A 20×20 grid (400 keypoints) is initialized in the reference frame.
For each frame $t$ : predict depth map $D_t$ and camera extrinsics $(R_t, t_t)$ via a VGGT model (fine-tuned SpatialTrackerV2).
Dense 2D keypoint tracking across frames with CoTracker3.
Each keypoint is “lifted” to 3D via $Z_{i,t} = D_t(u_{i,t}, v_{i,t})$ and $[X^c, Y^c, Z^c]^\top = Z^c K^{-1} [u, v, 1]^\top$ where $K$ is the intrinsic matrix.

C. World-to-Camera Alignment

All per-frame 3D points are transformed into the canonical reference frame of the chunk using $p_{i,t}^c = R_{\rm ref}(p_{i,t}^w - t_{\rm ref})$ .
3D traces are represented as $T_{\mathrm{ref}}^{t:t+L} \in \mathbb{R}^{K \times L \times 3}$ , i.e., screen-aligned sequences.

D. Speed Retargeting

For each trajectory $X_i(t)$ , cumulative arc-length $s(t)=\int_0^t \|\dot X_i(\tau)\| d\tau$ is computed.
The trace is resampled at uniform arc-length intervals to standardized length $L$ (e.g., 32), enabling normalized motion tempo for training.

3. Component Algorithms, Models, and Mathematical Formulation

Algorithms

Stage	Model/Algorithm Used	Purpose
Depth & Extrinsics	VGGT (SpatialTrackerV2)	Per-frame 3D lifting, camera reference normalization
2D Tracking	CoTracker3	Robust 2D keypoint tracks over variable, moving-camera videos
Persistent Geometry	TAPIP3D	(Optional) Geometry refinement in complex scenarios
Language Instruction Gen	Vision-LLM (JSON)	Generate task/action instructions automatically
Speed Normalization	Arc-length Reparam.	Trajectory resampling by intrinsic motion arc-length

Key mathematical expressions:

Camera lifting: $[X^c; Y^c; Z^c] = Z^c K^{-1} [u; v; 1]$
Reference frame projection: $p^c_{\mathrm{ref}} = R_{\mathrm{ref}}(p^w - t_{\mathrm{ref}})$
3D trace tensor: $T_{\mathrm{ref}}^{t:t+L} = \{ [u_{i, t}, v_{i, t}, Z_{i, t}^c] \}_{i=1 \ldots K, t=0 \ldots L}$

4. Dataset Normalization and Output Statistics

Dataset scale: 123,000 video chunks from 8 source domains, yielding 1.8M {RGB-D, trace, language} triplets.
Representation: Each chunk forms a 400×32×3 tensor (screen-aligned keypoints across 32 uniformly retimed steps).
Trace Modes: ~80% full 3D, ~20% fallback 2D-only (lacking depth or for static camera footage).
Depth normalization: At inference, predicted depth is rescaled via Gaussian-smoothed per-pixel adjustment to match sensor depth.
Instruction variability: Up to four textual variants per event chunk, preserving human-written plus auto-generated instructions.

5. Integration with TraceGen and Cross-Embodiment Few-Shot Learning

TraceForge’s normalized triplets directly serve as training (and adaptation) data for TraceGen, a flow-based generative world model operating in 3D trace space. For a new robot or manipulation domain:

A handful (5–15) of “warm-up” robot-specific videos, preprocessed by TraceForge, suffices for alignment of the motion prior to the new embodiment.
At deployment, TraceGen predicts future 3D traces (in the canonical camera reference) conditioned on RGB-D and language.
A lightweight inverse-kinematics module maps predicted traces to robot control in real time.
TraceGen, pretrained on TraceForge data, significantly improves sample efficiency and generalization: 80% mean success (with 5 robot videos) across four tasks; 67.5% even with only five in-the-wild human demo videos on a real robot; 50–600× faster inference than pixel-video models.

A plausible implication is that such trace-space normalization with language grounding enables highly reliable cross-embodiment skill transfer without explicit object detection or pixel-space synthesis, abstracting away most nuisance factors present in heterogeneous real demonstration video corpora (Lee et al., 26 Nov 2025).

6. Deployment Recommendations and Limitations

Deployment: Designed for scalable robotic skill and motion prior acquisition. Effective for research environments needing diverse multi-embodiment training data processed into a machine-actionable format.
Limitation: Depth reliability and trace quality degrade in scenes lacking sufficient structure for tracking or depth estimation; about 20% of traces are processed in 2D-only mode.
Future direction: Extension to richer manipulation semantics, more reliable unsupervised event segmentation, and direct policy learning on trace outputs.

For other usages such as identity tracing in forensic face-swapping scenarios, see the distinct TraceForge system built atop FaceTracer (Zhang et al., 11 Dec 2024).

PDF Markdown Chat (Pro)

References (2)

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos (2025)

FaceTracer: Unveiling Source Identities from Swapped Face Images and Videos for Fraud Prevention (2024)

TraceForge: 3D Modeling & Forensic Tracing

1. Unified, Symbolic 3D Representation Pipeline for Robotic World Modeling

Design Objectives

2. Pipeline Architecture and Processing Flow

Main Stages

3. Component Algorithms, Models, and Mathematical Formulation

Algorithms

4. Dataset Normalization and Output Statistics

5. Integration with TraceGen and Cross-Embodiment Few-Shot Learning

6. Deployment Recommendations and Limitations

Whiteboard

Follow Topic

Continue Learning

TraceForge: 3D Modeling & Forensic Tracing

1. Unified, Symbolic 3D Representation Pipeline for Robotic World Modeling

Design Objectives

2. Pipeline Architecture and Processing Flow

Main Stages

3. Component Algorithms, Models, and Mathematical Formulation

Algorithms

4. Dataset Normalization and Output Statistics

5. Integration with TraceGen and Cross-Embodiment Few-Shot Learning

6. Deployment Recommendations and Limitations

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics