Tracking Upsampler: Dense 2D Tracks

Updated 11 December 2025

Tracking Upsampler is a neural module that transforms sparse 2D pixel tracks into full-resolution dense tracks via locally supported, weighted interpolation.
It leverages CNN backbones and transformer-based cross-attention to aggregate spatially neighboring features for accurate, high-density 3D tracking.
Empirical results show up to 12× runtime reduction and significant improvements in metrics like EPE and IoU over full-resolution tracking methods.

A tracking upsampler is a neural module for transforming sparse or coarse two-dimensional (2D) pixel tracks—computed by computationally intensive tracking algorithms—into full-resolution, dense 2D tracks for every pixel in an image sequence. The tracking upsampler addresses the central challenge in long-term, dense 3D vision pipelines arising from the prohibitive overhead of exhaustively tracking all pixels using heavyweight models. It enables world-centric systems such as TrackingWorld (Lu et al., 9 Dec 2025) and DELTA (Ngo et al., 31 Oct 2024) to achieve high-density, accurate long-range 3D tracking by lifting sparse correspondences to dense tracks via lightweight, locally supported interpolation guided by appearance and geometric cues.

1. Motivation and Problem Definition

State-of-the-art 2D “any-point” trackers (e.g., CoTracker, TAPIR, DELTA) typically generate either (i) a limited set of long-lived but sparse tracks or (ii) grid-aligned dense tracks restricted to low resolution. Neither approach suffices for reconstructing the full pixelwise 3D trajectories required for dense, world-centric tracking, especially when dynamic objects emerge mid-sequence or when per-pixel detail is crucial for downstream tasks. Directly running a high-capacity tracker at per-pixel granularity induces prohibitive $O(HW)$ computation.

The upsampling task is thus formalized: Given sparse 2D track positions $P_{\mathrm{sparse}} \in \mathbb{R}^{K_s \times 2}$ and associated features $F_{\mathrm{sparse}} \in \mathbb{R}^{K_s \times C}$ for $K_s = (H/s) \cdot (W/s)$ coarse grid points in frame $t$ , predict dense 2D track positions $P_{\mathrm{dense}} \in \mathbb{R}^{(HW) \times 2}$ at full image resolution. Each dense pixel position is hypothesized as a convex combination of its spatially proximate sparse tracks: $P_{\mathrm{dense}}[j] = \sum_{i=1}^{K_s} W_{ij} P_{\mathrm{sparse}}[i],\qquad W_{ij} \geq 0,\ \sum_{i} W_{ij} = 1,$ where the weights are spatially local, typically involving only $k \ll K_s$ nearest neighbors.

2. Algorithmic Formulation and Pseudocode

The canonical tracking upsampler workflow consists of the following steps (framewise, with batch indices omitted):

def Upsample2D(P_sparse, F_sparse, Queries):
    F_query = QueryFeatureExtractor(Queries)    # [H*W x C_q]
    idx, dists = KNN(Queries, P_sparse, K=k)    # For each output pixel, find k nearest sparse points
    F_keys = GatherRows(F_sparse, idx)          # [H*W x k x C]
    P_keys = GatherRows(P_sparse, idx)          # [H*W x k x 2]
    W_logits = WeightNetwork(F_query, F_keys, dists) # [H*W x k]
    W = Softmax(W_logits, dim=1)                # normalized weights
    P_dense = np.sum(W[...,None] * P_keys, axis=1)   # weighted sum for each pixel
    return P_dense

Mathematically, for each dense pixel $Q_j$ and sparse neighbor $S_i$ in its $k$ -nearest set $\mathcal{N}(j)$ , the interpolation weight is: $W_{ij} = \begin{cases} \frac{\exp\bigl(g(F_q(Q_j), F_s(S_i), d(Q_j, S_i))\bigr)}{\sum_{i'\in \mathcal{N}(j)}\exp\bigl(g(\cdot))}, & i\in \mathcal{N}(j)\ 0,&\text{otherwise} \end{cases}$ where $g(\cdot)$ is a small multilayer perceptron and $d(Q_j,S_i)$ is the Euclidean distance.

3. Architecture in DELTA and TrackingWorld

In DELTA (Ngo et al., 31 Oct 2024), the upsampler is a transformer-based local cross-attention module. The pipeline operates as follows:

A CNN backbone extracts coarse features at $H/r\times W/r$ (typically $r=4$ ).
Global-local attention performs tracking at this low resolution, producing coarse 3D tracks $P(\mathrm{coarse}, u', v')$ and features $F(\mathrm{coarse}, u', v')$ .
The upsampler lifts these to fine resolution ( $H\times W$ ) by having each fine pixel (u, v) attend to its $\kappa \times \kappa$ coarse grid neighbors (with $\kappa=3$ by default). Attention logits are

$S((u,v),(u_r,v_r)) = q(u,v) \cdot k(u_r, v_r) + m \cdot \| (u', v') - (u_r, v_r) \|_1,$

where the Alibi bias $m$ encodes spatial proximity.

The upsampled position is

$P(\mathrm{fine};u,v) = \sum_{(u_r, v_r)} w_{u,v,(u_r,v_r)} \, P(\mathrm{coarse};u_r,v_r),$

with $w$ derived from the attention weights after softmax and local MLP refinement.

The query/key features are created by upsampling the coarse CNN features and concatenating the appropriate track features.

In TrackingWorld (Lu et al., 9 Dec 2025), this architecture is repurposed generically: any arbitrary sparse set of 2D tracks and features can serve as input, provided the features encode sufficient neighborhood appearance and geometry (e.g., DELTA’s correlation maps). No new loss is introduced for the upsampler module in TrackingWorld, which inherits its parameters from DELTA.

4. Computational Properties and Practical Trade-Offs

The complexity for upsampling with $N_s$ sparse and $N_d$ dense points, $k$ neighbors, and feature/channel dimension $C$ , is:

k-NN search: $O(N_d \log N_s)$ (or $O(N_d k)$ with approximations),
Feature gathering: $O(N_d k C)$ ,
MLP forward: $O(N_d k C_{\mathrm{mlp}})$ ,
Weighted sum: $O(N_d k)$ .

Empirically, for $H=480, W=640, s=4$ (i.e., $N_s \sim 19$ K, $N_d \sim 307$ K, $k=9$ ), the upsampler processes a frame in $\sim 15$ –$20$ ms using an NVIDIA RTX 4090. It constitutes less than $2\%$ of the runtime in the end-to-end TrackingWorld pipeline for 30 frames, compared to several seconds per frame for a full-resolution tracker (Lu et al., 9 Dec 2025). In DELTA, the cross-attention upsampler incurs only minor overhead versus the coarse tracking stage (Ngo et al., 31 Oct 2024).

5. Supervision and Losses

In DELTA, total training loss is: $\mathcal{L}_{\text{total}} = \lambda_{2D}\mathcal{L}_{2D} + \lambda_{\text{depth}}\mathcal{L}_{\text{depth}} + \lambda_{\text{visib}}\mathcal{L}_{\text{visib}},$ where $\mathcal{L}_{2D}$ is an $L_1$ loss on 2D displacements (applied at both coarse and fine resolutions), $\mathcal{L}_{\text{depth}}$ is an $L_1$ loss on predicted depth changes (using $\Delta \log d$ for invariance), and $\mathcal{L}_{\text{visib}}$ is a binary cross-entropy on visibility. No bespoke loss is added for the upsampler; supervision is achieved simply by training on upsampled outputs. In TrackingWorld, all learning occurs in DELTA; the upsampler itself is not further trained unless one adds an explicit reconstruction objective.

6. Empirical Performance and Ablations

On the CVO test set (500 sequences, 7 frames), integrating the upsampler yields strong accuracy/runtime trade-offs:

Method	EPE↓	IoU↑	Runtime (min)
CoTrackerV3 (full-res)	1.45	76.8%	3.00
CoTrackerV3 + Upsampler	1.24	80.9%	0.25

The upsampler improves EPE by $\sim$ 0.21 px (15%) and visible region IoU by 4.1 points, while reducing runtime by %%%%45 $d(Q_j,S_i)$ 46%%%% due to low-res tracking plus efficient upsampling (Lu et al., 9 Dec 2025). Ablations on neighbor count $k$ show optimal accuracy at $k=9$ , with diminishing returns or degraded accuracy for $k<9$ or $k>16$ . In TrackingWorld's pipeline, substituting CoTrackerV3 + Upsampler yields 3D depth and pose metrics on par with DELTA’s original module, confirming modularity.

In DELTA, ablations (Table 4c) highlight that the transformer-based upsampler with Alibi spatial bias achieves $\sim$ 30% lower EPE (3.67 vs 5.31) compared to bilinear baseline, and improves occlusion accuracy.

7. Applicability, Limitations, and Extensions

The tracking upsampler generalizes to any scenario in which sparse tracks with suitable feature descriptors are available. It is agnostic to the underlying nature of the base tracker, supporting plug-and-play integration with trackers such as CoTracker, DELTA, or TAPIR, provided the interface contract (positions, features) holds (Lu et al., 9 Dec 2025). The interpolation weights can be recomputed per sequence window, supporting both stationary and temporally dynamic scenes. Locality—restricting attention to spatial neighborhoods—keeps memory and latency scaling linear in the number of pixels.

A practical consideration is kernel support: $\kappa$ should cover potential motion magnitudes at the chosen low resolution, and features must be expressive enough to guide accurate interpolation. The Alibi spatial bias stabilizes generalization, but can be omitted for ablation.

No limitations specific to the upsampler have been emphasized in the referenced works, but a plausible implication is that extreme nonlocal dynamics or feature miscalibration would degrade interpolation. There is no evidence of accuracy degradation when swapping in third-party upsamplers in world-centric 3D pipelines, suggesting robustness and architectural independence.

The tracking upsampler is thus established as a robust, efficient bridge from sparse or coarse grid 2D tracks to per-pixel dense tracks, critical in scalable long-term 3D monocular tracking pipelines (Lu et al., 9 Dec 2025, Ngo et al., 31 Oct 2024). It balances computational efficiency, accuracy, and modularity, and is empirically validated to confer substantial speedup and quality improvements over naive or heuristic upsampling approaches.