Tracking Upsampler: Dense 2D Tracks
- Tracking Upsampler is a neural module that transforms sparse 2D pixel tracks into full-resolution dense tracks via locally supported, weighted interpolation.
- It leverages CNN backbones and transformer-based cross-attention to aggregate spatially neighboring features for accurate, high-density 3D tracking.
- Empirical results show up to 12× runtime reduction and significant improvements in metrics like EPE and IoU over full-resolution tracking methods.
A tracking upsampler is a neural module for transforming sparse or coarse two-dimensional (2D) pixel tracks—computed by computationally intensive tracking algorithms—into full-resolution, dense 2D tracks for every pixel in an image sequence. The tracking upsampler addresses the central challenge in long-term, dense 3D vision pipelines arising from the prohibitive overhead of exhaustively tracking all pixels using heavyweight models. It enables world-centric systems such as TrackingWorld (Lu et al., 9 Dec 2025) and DELTA (Ngo et al., 31 Oct 2024) to achieve high-density, accurate long-range 3D tracking by lifting sparse correspondences to dense tracks via lightweight, locally supported interpolation guided by appearance and geometric cues.
1. Motivation and Problem Definition
State-of-the-art 2D “any-point” trackers (e.g., CoTracker, TAPIR, DELTA) typically generate either (i) a limited set of long-lived but sparse tracks or (ii) grid-aligned dense tracks restricted to low resolution. Neither approach suffices for reconstructing the full pixelwise 3D trajectories required for dense, world-centric tracking, especially when dynamic objects emerge mid-sequence or when per-pixel detail is crucial for downstream tasks. Directly running a high-capacity tracker at per-pixel granularity induces prohibitive computation.
The upsampling task is thus formalized: Given sparse 2D track positions and associated features for coarse grid points in frame , predict dense 2D track positions at full image resolution. Each dense pixel position is hypothesized as a convex combination of its spatially proximate sparse tracks: where the weights are spatially local, typically involving only nearest neighbors.
2. Algorithmic Formulation and Pseudocode
The canonical tracking upsampler workflow consists of the following steps (framewise, with batch indices omitted):
1 2 3 4 5 6 7 8 9 |
def Upsample2D(P_sparse, F_sparse, Queries): F_query = QueryFeatureExtractor(Queries) # [H*W x C_q] idx, dists = KNN(Queries, P_sparse, K=k) # For each output pixel, find k nearest sparse points F_keys = GatherRows(F_sparse, idx) # [H*W x k x C] P_keys = GatherRows(P_sparse, idx) # [H*W x k x 2] W_logits = WeightNetwork(F_query, F_keys, dists) # [H*W x k] W = Softmax(W_logits, dim=1) # normalized weights P_dense = np.sum(W[...,None] * P_keys, axis=1) # weighted sum for each pixel return P_dense |
Mathematically, for each dense pixel and sparse neighbor in its -nearest set , the interpolation weight is: where is a small multilayer perceptron and is the Euclidean distance.
3. Architecture in DELTA and TrackingWorld
In DELTA (Ngo et al., 31 Oct 2024), the upsampler is a transformer-based local cross-attention module. The pipeline operates as follows:
- A CNN backbone extracts coarse features at (typically ).
- Global-local attention performs tracking at this low resolution, producing coarse 3D tracks and features .
- The upsampler lifts these to fine resolution () by having each fine pixel (u, v) attend to its coarse grid neighbors (with by default). Attention logits are
where the Alibi bias encodes spatial proximity.
- The upsampled position is
with derived from the attention weights after softmax and local MLP refinement.
- The query/key features are created by upsampling the coarse CNN features and concatenating the appropriate track features.
In TrackingWorld (Lu et al., 9 Dec 2025), this architecture is repurposed generically: any arbitrary sparse set of 2D tracks and features can serve as input, provided the features encode sufficient neighborhood appearance and geometry (e.g., DELTA’s correlation maps). No new loss is introduced for the upsampler module in TrackingWorld, which inherits its parameters from DELTA.
4. Computational Properties and Practical Trade-Offs
The complexity for upsampling with sparse and dense points, neighbors, and feature/channel dimension , is:
- k-NN search: (or with approximations),
- Feature gathering: ,
- MLP forward: ,
- Weighted sum: .
Empirically, for (i.e., K, K, ), the upsampler processes a frame in –$20$ ms using an NVIDIA RTX 4090. It constitutes less than of the runtime in the end-to-end TrackingWorld pipeline for 30 frames, compared to several seconds per frame for a full-resolution tracker (Lu et al., 9 Dec 2025). In DELTA, the cross-attention upsampler incurs only minor overhead versus the coarse tracking stage (Ngo et al., 31 Oct 2024).
5. Supervision and Losses
In DELTA, total training loss is: where is an loss on 2D displacements (applied at both coarse and fine resolutions), is an loss on predicted depth changes (using for invariance), and is a binary cross-entropy on visibility. No bespoke loss is added for the upsampler; supervision is achieved simply by training on upsampled outputs. In TrackingWorld, all learning occurs in DELTA; the upsampler itself is not further trained unless one adds an explicit reconstruction objective.
6. Empirical Performance and Ablations
On the CVO test set (500 sequences, 7 frames), integrating the upsampler yields strong accuracy/runtime trade-offs:
| Method | EPE↓ | IoU↑ | Runtime (min) |
|---|---|---|---|
| CoTrackerV3 (full-res) | 1.45 | 76.8% | 3.00 |
| CoTrackerV3 + Upsampler | 1.24 | 80.9% | 0.25 |
The upsampler improves EPE by 0.21 px (15%) and visible region IoU by 4.1 points, while reducing runtime by %%%%4546%%%% due to low-res tracking plus efficient upsampling (Lu et al., 9 Dec 2025). Ablations on neighbor count show optimal accuracy at , with diminishing returns or degraded accuracy for or . In TrackingWorld's pipeline, substituting CoTrackerV3 + Upsampler yields 3D depth and pose metrics on par with DELTA’s original module, confirming modularity.
In DELTA, ablations (Table 4c) highlight that the transformer-based upsampler with Alibi spatial bias achieves 30% lower EPE (3.67 vs 5.31) compared to bilinear baseline, and improves occlusion accuracy.
7. Applicability, Limitations, and Extensions
The tracking upsampler generalizes to any scenario in which sparse tracks with suitable feature descriptors are available. It is agnostic to the underlying nature of the base tracker, supporting plug-and-play integration with trackers such as CoTracker, DELTA, or TAPIR, provided the interface contract (positions, features) holds (Lu et al., 9 Dec 2025). The interpolation weights can be recomputed per sequence window, supporting both stationary and temporally dynamic scenes. Locality—restricting attention to spatial neighborhoods—keeps memory and latency scaling linear in the number of pixels.
A practical consideration is kernel support: should cover potential motion magnitudes at the chosen low resolution, and features must be expressive enough to guide accurate interpolation. The Alibi spatial bias stabilizes generalization, but can be omitted for ablation.
No limitations specific to the upsampler have been emphasized in the referenced works, but a plausible implication is that extreme nonlocal dynamics or feature miscalibration would degrade interpolation. There is no evidence of accuracy degradation when swapping in third-party upsamplers in world-centric 3D pipelines, suggesting robustness and architectural independence.
The tracking upsampler is thus established as a robust, efficient bridge from sparse or coarse grid 2D tracks to per-pixel dense tracks, critical in scalable long-term 3D monocular tracking pipelines (Lu et al., 9 Dec 2025, Ngo et al., 31 Oct 2024). It balances computational efficiency, accuracy, and modularity, and is empirically validated to confer substantial speedup and quality improvements over naive or heuristic upsampling approaches.