TAPIP3D: Persistent 3D Point Tracking

Updated 13 January 2026

TAPIP3D is a novel framework for tracking arbitrary points in persistent 3D geometry using monocular or stereo video inputs and world-coordinate stabilization.
It employs multi-scale feature extraction, iterative Transformer refinement, and 3D neighborhood attention to achieve robust, long-horizon 3D trajectory estimation.
Evaluations on benchmarks like TAPVid-3D and clinical stereo tissue tracking demonstrate improved accuracy and potential applications in clinical, robotic, and general video analysis.

TAPIP3D refers to methods and benchmarks for Tracking Any Point in Persistent 3D Geometry, a paradigm that generalizes classical pixel tracking to robust, long-horizon 3D trajectory estimation from monocular RGB(-D) or stereo video sources. This framework leverages feature clouds stabilized in 3D world or camera coordinates and utilizes spatially-aware deep attention mechanisms, transforming point tracking accuracy and temporal consistency across static and dynamic scenes. The TAPIP3D canon today includes the core model “Tracking Any Point in Persistent 3D Geometry” (Zhang et al., 20 Apr 2025), clinically-geared stereo tissue tracking (Reuter et al., 11 Aug 2025), and the rigorous TAPVid-3D benchmark (Koppula et al., 2024).

1. Problem Formulation and Scope

TAPIP3D extends 2D TAP protocols, which track pixels or image patches over time, to recover the full 3D trajectories $\tau_q = \{(X_q^{(t)}, Y_q^{(t)}, Z_q^{(t)})\,|\,t=1,\ldots,T\}$ for arbitrary query points specified in video frames. Inputs to TAPIP3D models typically consist of:

A monocular RGB, RGB-D, or stereo video sequence
Per-frame depth maps either from physical sensors (e.g., LIDAR, stereo cameras) or learned monocular estimators (e.g., ZoeDepth, COLMAP)
Optional camera intrinsic and extrinsic matrices for pose and coordinate stabilization

The underlying goal is to estimate both 3D positions and binary visibility flags per point and time step, i.e., $(X,Y,Z,o_q^{(t)})$ , thus handling occlusions and ambiguities. A key contribution of TAPIP3D (Zhang et al., 20 Apr 2025) is the stabilization of feature clouds into a persistent 3D world frame, efficiently cancelling camera motion and enabling temporally smooth trajectories.

2. Model Architecture and Computational Pipeline

The TAPIP3D pipeline comprises several distinct stages:

Feature Extraction and Lifting: Each frame is processed by a shared CNN backbone (frequently ResNet-50-style, e.g., CoTracker3) to produce per-pixel feature maps. These features are then unprojected into 3D via depth and camera calibration parameters, yielding point clouds $F^{3D,s,t} \in \mathbb{R}^{N_s \times (3+C)}$ at multiple scales.
Camera Stabilization: Coordinates are optionally transformed from camera-centric to world-centric using per-frame extrinsics, which normalizes trajectories of static objects and improves temporal regularity under camera motion.
3D Neighborhood-to-Neighborhood Attention ("N2N"): For correlation-based matching, the model assembles two local neighborhoods per query: one around the reference location in a source frame and another in the target frame. A cross-attention module computes deep similarity between these 3D neighborhoods, explicitly respecting spatial distance and relative position (Zhang et al., 20 Apr 2025).
Iterative Motion Refinement: Frame-by-frame query tokens, combining local 3D correlations, relative displacements, projected pixel positions, and predicted visibility, are iteratively refined by a Transformer model. Each cycle yields updated motion increments and occlusion flags for every point track.
Stereo Endoscopic Variant: For tissue tracking in stereo surgical images, TAPIP3D adapts two CoTracker3 TAP networks—one for temporal 2D tracking, one for stereo matching. Disparity and stereo geometry yield triangulated depth, with the pipeline aggregating across small RoI grid ensembles for noise reduction (Reuter et al., 11 Aug 2025).

Key architectural details include nearest-neighbor downsampling for efficient multi-scale processing, multi-resolution feature pooling (typically S=3 levels), and train-time augmentation via random rigid transformations of the lifted 3D point clouds.

3. Mathematical Foundations

Core computations in TAPIP3D are explicitly grounded in projective geometry and deep matching:

2D-to-3D Unprojection: Point $(u,v)$ with depth $d$ in a camera with intrinsics $(f, c_u,c_v)$ is mapped to $(X,Y,Z)$ by

$X = (u - c_u)d/f,\quad Y = (v - c_v)d/f,\quad Z = d$

When tracking in world coordinates, $[X,Y,Z,1]^\top = \text{Cam}^{(t)}^{-1} \cdot [X,Y,Z,1]^\top$.

Stereo Triangulation: In a calibrated rectified stereo pair, disparity $d=u^L-u^R$ and baseline $b$ reconstruct depth $Z=fb/d$ and spatial coordinates as above (Reuter et al., 11 Aug 2025).
Attention Score: For neighborhoods $Q=\{(f_i^q,p_i^q)\}$ and $C=\{(f_j^c,p_j^c)\}$ , attention is computed as:

$f(q_i,k_j) = (W_q x_i^q)^\top (W_k x_j^c) + b$

with

$\alpha_{ij} = \exp(f(q_i, k_j)) / \sum_{m \in C} \exp(f(q_i, k_m))$

and correlation pool $y_i = \sum_{j\in C} \alpha_{ij}(W_v x_j^c)$ .

Optimization Loss: For ground-truth locations $\hat{\tau}_q^{(t)}$ and visibility $\hat{o}_q^{(t)}$ , per-iteration loss is:

$L = \sum_{q,t} \left[ \frac{1}{d_q^{(t)}} \|\tau_q^{(t)} - \hat{\tau}_q^{(t)}\|_2 + \alpha_{\mathrm{vis}} \mathrm{CE}(o_q^{(t)},\hat{o}_q^{(t)}) \right]$

Depth-weighting suppresses loss for distant points.

4. Evaluation Protocols and Benchmarking

TAPVid-3D (Koppula et al., 2024) provides an extensive benchmark for TAPIP3D assessment, with over 4,000 real-world clips from the Aria Digital Twin, DriveTrack (Waymo car LiDAR), and Panoptic Studio datasets. Evaluation metrics include:

3D Average Jaccard ( $\mathrm{AJ}_{3D}$ ):

$\mathrm{AJ}_{3D} = \frac{\sum_{i,t}v^i_t\,\hat v^i_t\,\alpha^i_t} {\sum_{i,t}v^i_t + \sum_{i,t}(1-v^i_t)\hat v^i_t + \sum_{i,t}v^i_t\,\hat v^i_t\,(1-\alpha^i_t)}$

where $\alpha^i_t$ flags distance under 3D adaptive thresholds.

3D Point-Displacement Accuracy (APD $_3D$ ): Fraction of visible points with distance below a depth-adaptive threshold determined by focal length and pixel scale.
Occlusion Accuracy (OA): Percentage of correctly predicted visible/invisible flags.

Empirical performance shows that TAPIP3D (TAP3D) outperforms adapted 2D trackers (TAPIR, CoTracker, BootsTAPIR with COLMAP or ZoeDepth) and explicit monocular 3D models (SpatialTracker, DELTA) across multiple metrics and datasets, particularly in world-centric mode where camera extrinsics are leveraged to stabilize trajectories (Zhang et al., 20 Apr 2025). On TAPVid-3D, for example, TAPIP3D achieves 30.3 AJ $_3D$ compared to DELTA's 26.4 (DexYCB-Pt, sensor depth), and up to 72.2 AJ $_3D$ on LSFOdyssey synthetic trajectories (Zhang et al., 20 Apr 2025).

For surgical tissue tracking with stereo video, TAPIP3D records mean errors down to 1.1 mm at 10 mm/s on chicken tissue phantoms, robust tracking up to 80 mm/s, and real-time performance at ~33 FPS on commercial hardware (Reuter et al., 11 Aug 2025).

5. Implementation Details

The TAPIP3D modeling stack integrates:

2D image encoder: ResNet-50-style, pretrained on TAP-VID, supporting input resolutions typical for video tracking (e.g., 1080×1920 for endoscopic stereo).
Multi-scale neighborhood processing: S=3 downsampling levels; K=16–32 nearest 3D neighbors for N2N correlation.
Transformer refinement: 4 motion refinement iterations per test window (typically 16 frames, slid by 8); trained for 200,000 steps with AdamW, cosine schedule ( $\text{lr}=5{\times}10^{-4}$ , weight decay $5{\times}10^{-4}$ ).
Data augmentation: Random rigid perturbation of 3D clouds, color jitter, blur, specular highlights, and template-driven region-of-interest selection for medical scenarios.

Official code and dataset access for TAPVid-3D are provided at https://tapvid3d.github.io, with standard .npy formats for tracks, queries, visibility, and camera intrinsics enabling reproducible annotation and evaluation (Koppula et al., 2024).

6. Limitations, Robustness, and Extensions

Failure cases are dominated by:

Inaccurate or noisy depth maps (sensor or learned), which propagate errors through lifting and geometric triangulation
Motion blur and specular highlights, especially for stereo matching in surgical imaging (Reuter et al., 11 Aug 2025)
Scalability limitations: Excessive query point density degrades both error and real-time processing speed due to quadratic neighborhood growth (Reuter et al., 11 Aug 2025)

Suggested extensions from the literature include:

Informative or learned point selection to optimize trackability and landmark stability
Shared encoders for temporal and stereo streams to reduce computational redundancy
End-to-end fine-tuning on domain-specific scene data (e.g., surgical or highly deformable tissue)
Integration of geometric or learned priors for more robust depth regularization

A plausible implication is that TAPIP3D's modular backbone will benefit directly from improvements in deep feature encoding, 3D point cloud attention architectures, and joint pose-depth estimation pipelines, further broadening the applicability of markerless 3D tracking in clinical, robotic, and general video analysis domains.

7. Relationship to Broader TAPIR Frameworks

TAPIP3D occupies a distinct but related space to $\texttt{tapir}$ (Gerlach et al., 2022), which is a software environment for topology identification and amplitude reduction in high-energy physics (multi-loop Feynman integrals). While both adopt the “Tracking Any Point” philosophy, TAPIP3D is centered on computer vision for dynamic 3D scenes, whereas $\texttt{tapir}$ and its tooling are concerned with graph-theoretic minimization, cut-filtering, and partial fraction decomposition for Feynman diagrams. The separation of physical meaning suggests no direct algorithmic overlap, though both frameworks share a commitment to flexible, modular, and scalable research computing.

In summary, TAPIP3D defines a robust paradigm for long-term, markerless 3D trajectory estimation in diverse video environments, coupling deep attention and geometric lifting to outperform prior art in both clinical and general-purpose benchmarks.

Markdown Upgrade to Chat

References (4)

TAPIP3D: Tracking Any Point in Persistent 3D Geometry (2025)

Tracking Any Point Methods for Markerless 3D Tissue Tracking in Endoscopic Stereo Images (2025)

TAPVid-3D: A Benchmark for Tracking Any Point in 3D (2024)

$\texttt{tapir}$: A tool for topologies, amplitudes, partial fraction decomposition and input for reductions (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TAPIP3D.

TAPIP3D: Persistent 3D Point Tracking

1. Problem Formulation and Scope

2. Model Architecture and Computational Pipeline

3. Mathematical Foundations

4. Evaluation Protocols and Benchmarking

5. Implementation Details

6. Limitations, Robustness, and Extensions

7. Relationship to Broader TAPIR Frameworks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

TAPIP3D: Persistent 3D Point Tracking

1. Problem Formulation and Scope

2. Model Architecture and Computational Pipeline

3. Mathematical Foundations

4. Evaluation Protocols and Benchmarking

5. Implementation Details

6. Limitations, Robustness, and Extensions

7. Relationship to Broader TAPIR Frameworks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research