Papers
Topics
Authors
Recent
Search
2000 character limit reached

TAPIP3D: Persistent 3D Point Tracking

Updated 13 January 2026
  • TAPIP3D is a novel framework for tracking arbitrary points in persistent 3D geometry using monocular or stereo video inputs and world-coordinate stabilization.
  • It employs multi-scale feature extraction, iterative Transformer refinement, and 3D neighborhood attention to achieve robust, long-horizon 3D trajectory estimation.
  • Evaluations on benchmarks like TAPVid-3D and clinical stereo tissue tracking demonstrate improved accuracy and potential applications in clinical, robotic, and general video analysis.

TAPIP3D refers to methods and benchmarks for Tracking Any Point in Persistent 3D Geometry, a paradigm that generalizes classical pixel tracking to robust, long-horizon 3D trajectory estimation from monocular RGB(-D) or stereo video sources. This framework leverages feature clouds stabilized in 3D world or camera coordinates and utilizes spatially-aware deep attention mechanisms, transforming point tracking accuracy and temporal consistency across static and dynamic scenes. The TAPIP3D canon today includes the core model “Tracking Any Point in Persistent 3D Geometry” (Zhang et al., 20 Apr 2025), clinically-geared stereo tissue tracking (Reuter et al., 11 Aug 2025), and the rigorous TAPVid-3D benchmark (Koppula et al., 2024).

1. Problem Formulation and Scope

TAPIP3D extends 2D TAP protocols, which track pixels or image patches over time, to recover the full 3D trajectories τq={(Xq(t),Yq(t),Zq(t))t=1,,T}\tau_q = \{(X_q^{(t)}, Y_q^{(t)}, Z_q^{(t)})\,|\,t=1,\ldots,T\} for arbitrary query points specified in video frames. Inputs to TAPIP3D models typically consist of:

  • A monocular RGB, RGB-D, or stereo video sequence
  • Per-frame depth maps either from physical sensors (e.g., LIDAR, stereo cameras) or learned monocular estimators (e.g., ZoeDepth, COLMAP)
  • Optional camera intrinsic and extrinsic matrices for pose and coordinate stabilization

The underlying goal is to estimate both 3D positions and binary visibility flags per point and time step, i.e., (X,Y,Z,oq(t))(X,Y,Z,o_q^{(t)}), thus handling occlusions and ambiguities. A key contribution of TAPIP3D (Zhang et al., 20 Apr 2025) is the stabilization of feature clouds into a persistent 3D world frame, efficiently cancelling camera motion and enabling temporally smooth trajectories.

2. Model Architecture and Computational Pipeline

The TAPIP3D pipeline comprises several distinct stages:

  1. Feature Extraction and Lifting: Each frame is processed by a shared CNN backbone (frequently ResNet-50-style, e.g., CoTracker3) to produce per-pixel feature maps. These features are then unprojected into 3D via depth and camera calibration parameters, yielding point clouds F3D,s,tRNs×(3+C)F^{3D,s,t} \in \mathbb{R}^{N_s \times (3+C)} at multiple scales.
  2. Camera Stabilization: Coordinates are optionally transformed from camera-centric to world-centric using per-frame extrinsics, which normalizes trajectories of static objects and improves temporal regularity under camera motion.
  3. 3D Neighborhood-to-Neighborhood Attention ("N2N"): For correlation-based matching, the model assembles two local neighborhoods per query: one around the reference location in a source frame and another in the target frame. A cross-attention module computes deep similarity between these 3D neighborhoods, explicitly respecting spatial distance and relative position (Zhang et al., 20 Apr 2025).
  4. Iterative Motion Refinement: Frame-by-frame query tokens, combining local 3D correlations, relative displacements, projected pixel positions, and predicted visibility, are iteratively refined by a Transformer model. Each cycle yields updated motion increments and occlusion flags for every point track.
  5. Stereo Endoscopic Variant: For tissue tracking in stereo surgical images, TAPIP3D adapts two CoTracker3 TAP networks—one for temporal 2D tracking, one for stereo matching. Disparity and stereo geometry yield triangulated depth, with the pipeline aggregating across small RoI grid ensembles for noise reduction (Reuter et al., 11 Aug 2025).

Key architectural details include nearest-neighbor downsampling for efficient multi-scale processing, multi-resolution feature pooling (typically S=3 levels), and train-time augmentation via random rigid transformations of the lifted 3D point clouds.

3. Mathematical Foundations

Core computations in TAPIP3D are explicitly grounded in projective geometry and deep matching:

  • 2D-to-3D Unprojection: Point (u,v)(u,v) with depth dd in a camera with intrinsics (f,cu,cv)(f, c_u,c_v) is mapped to (X,Y,Z)(X,Y,Z) by

X=(ucu)d/f,Y=(vcv)d/f,Z=dX = (u - c_u)d/f,\quad Y = (v - c_v)d/f,\quad Z = d

When tracking in world coordinates, $[X,Y,Z,1]^\top = \text{Cam}^{(t)}^{-1} \cdot [X,Y,Z,1]^\top$.

  • Stereo Triangulation: In a calibrated rectified stereo pair, disparity d=uLuRd=u^L-u^R and baseline bb reconstruct depth Z=fb/dZ=fb/d and spatial coordinates as above (Reuter et al., 11 Aug 2025).
  • Attention Score: For neighborhoods Q={(fiq,piq)}Q=\{(f_i^q,p_i^q)\} and C={(fjc,pjc)}C=\{(f_j^c,p_j^c)\}, attention is computed as:

f(qi,kj)=(Wqxiq)(Wkxjc)+bf(q_i,k_j) = (W_q x_i^q)^\top (W_k x_j^c) + b

with

αij=exp(f(qi,kj))/mCexp(f(qi,km))\alpha_{ij} = \exp(f(q_i, k_j)) / \sum_{m \in C} \exp(f(q_i, k_m))

and correlation pool yi=jCαij(Wvxjc)y_i = \sum_{j\in C} \alpha_{ij}(W_v x_j^c).

  • Optimization Loss: For ground-truth locations τ^q(t)\hat{\tau}_q^{(t)} and visibility o^q(t)\hat{o}_q^{(t)}, per-iteration loss is:

L=q,t[1dq(t)τq(t)τ^q(t)2+αvisCE(oq(t),o^q(t))]L = \sum_{q,t} \left[ \frac{1}{d_q^{(t)}} \|\tau_q^{(t)} - \hat{\tau}_q^{(t)}\|_2 + \alpha_{\mathrm{vis}} \mathrm{CE}(o_q^{(t)},\hat{o}_q^{(t)}) \right]

Depth-weighting suppresses loss for distant points.

4. Evaluation Protocols and Benchmarking

TAPVid-3D (Koppula et al., 2024) provides an extensive benchmark for TAPIP3D assessment, with over 4,000 real-world clips from the Aria Digital Twin, DriveTrack (Waymo car LiDAR), and Panoptic Studio datasets. Evaluation metrics include:

  • 3D Average Jaccard (AJ3D\mathrm{AJ}_{3D}):

AJ3D=i,tvtiv^tiαtii,tvti+i,t(1vti)v^ti+i,tvtiv^ti(1αti)\mathrm{AJ}_{3D} = \frac{\sum_{i,t}v^i_t\,\hat v^i_t\,\alpha^i_t} {\sum_{i,t}v^i_t + \sum_{i,t}(1-v^i_t)\hat v^i_t + \sum_{i,t}v^i_t\,\hat v^i_t\,(1-\alpha^i_t)}

where αti\alpha^i_t flags distance under 3D adaptive thresholds.

  • 3D Point-Displacement Accuracy (APD3D_3D): Fraction of visible points with distance below a depth-adaptive threshold determined by focal length and pixel scale.
  • Occlusion Accuracy (OA): Percentage of correctly predicted visible/invisible flags.

Empirical performance shows that TAPIP3D (TAP3D) outperforms adapted 2D trackers (TAPIR, CoTracker, BootsTAPIR with COLMAP or ZoeDepth) and explicit monocular 3D models (SpatialTracker, DELTA) across multiple metrics and datasets, particularly in world-centric mode where camera extrinsics are leveraged to stabilize trajectories (Zhang et al., 20 Apr 2025). On TAPVid-3D, for example, TAPIP3D achieves 30.3 AJ3D_3D compared to DELTA's 26.4 (DexYCB-Pt, sensor depth), and up to 72.2 AJ3D_3D on LSFOdyssey synthetic trajectories (Zhang et al., 20 Apr 2025).

For surgical tissue tracking with stereo video, TAPIP3D records mean errors down to 1.1 mm at 10 mm/s on chicken tissue phantoms, robust tracking up to 80 mm/s, and real-time performance at ~33 FPS on commercial hardware (Reuter et al., 11 Aug 2025).

5. Implementation Details

The TAPIP3D modeling stack integrates:

  • 2D image encoder: ResNet-50-style, pretrained on TAP-VID, supporting input resolutions typical for video tracking (e.g., 1080×1920 for endoscopic stereo).
  • Multi-scale neighborhood processing: S=3 downsampling levels; K=16–32 nearest 3D neighbors for N2N correlation.
  • Transformer refinement: 4 motion refinement iterations per test window (typically 16 frames, slid by 8); trained for 200,000 steps with AdamW, cosine schedule (lr=5×104\text{lr}=5{\times}10^{-4}, weight decay 5×1045{\times}10^{-4}).
  • Data augmentation: Random rigid perturbation of 3D clouds, color jitter, blur, specular highlights, and template-driven region-of-interest selection for medical scenarios.

Official code and dataset access for TAPVid-3D are provided at https://tapvid3d.github.io, with standard .npy formats for tracks, queries, visibility, and camera intrinsics enabling reproducible annotation and evaluation (Koppula et al., 2024).

6. Limitations, Robustness, and Extensions

Failure cases are dominated by:

  • Inaccurate or noisy depth maps (sensor or learned), which propagate errors through lifting and geometric triangulation
  • Motion blur and specular highlights, especially for stereo matching in surgical imaging (Reuter et al., 11 Aug 2025)
  • Scalability limitations: Excessive query point density degrades both error and real-time processing speed due to quadratic neighborhood growth (Reuter et al., 11 Aug 2025)

Suggested extensions from the literature include:

  • Informative or learned point selection to optimize trackability and landmark stability
  • Shared encoders for temporal and stereo streams to reduce computational redundancy
  • End-to-end fine-tuning on domain-specific scene data (e.g., surgical or highly deformable tissue)
  • Integration of geometric or learned priors for more robust depth regularization

A plausible implication is that TAPIP3D's modular backbone will benefit directly from improvements in deep feature encoding, 3D point cloud attention architectures, and joint pose-depth estimation pipelines, further broadening the applicability of markerless 3D tracking in clinical, robotic, and general video analysis domains.

7. Relationship to Broader TAPIR Frameworks

TAPIP3D occupies a distinct but related space to tapir\texttt{tapir} (Gerlach et al., 2022), which is a software environment for topology identification and amplitude reduction in high-energy physics (multi-loop Feynman integrals). While both adopt the “Tracking Any Point” philosophy, TAPIP3D is centered on computer vision for dynamic 3D scenes, whereas tapir\texttt{tapir} and its tooling are concerned with graph-theoretic minimization, cut-filtering, and partial fraction decomposition for Feynman diagrams. The separation of physical meaning suggests no direct algorithmic overlap, though both frameworks share a commitment to flexible, modular, and scalable research computing.

In summary, TAPIP3D defines a robust paradigm for long-term, markerless 3D trajectory estimation in diverse video environments, coupling deep attention and geometric lifting to outperform prior art in both clinical and general-purpose benchmarks.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TAPIP3D.