System 2 Stage Tracking Framework

Updated 10 December 2025

The framework implements a sequential dual-stage process where Stage 1 uses bipartite matching for high-confidence tracklets and Stage 2 fuses them using contextual reasoning.
It leverages both appearance and motion cues along with graph-based global inference to reduce fragmentation and prevent identity switches.
Empirical benchmarks in MOT applications demonstrate improved tracking accuracy and robustness against occlusion, ambiguous matches, and distractors.

System 2 Stage Tracking is a tracking framework characterized by a sequential composition of two association or inference stages designed to leverage both high-purity local data association and global or contextual reasoning, thereby increasing robustness against fragmentation, occlusions, and identity switches. Across domains including multi-object tracking (MOT), single-object tracking, point tracking, and control, System 2 architectures typically implement an initial high-fidelity data association that generates fragmented but maximally trustworthy entities (tracklets, regions, or points), followed by a contextual or hierarchical association stage that resolves fragmentation across time and deals with harder association problems such as ambiguous matches, occlusions, and outlier distractors. Contemporary instantiations span message-passing GNN frameworks (Guo et al., 2024), min-cost flow graph correction (Li et al., 2023), dual-stage detection-tracking (Xu et al., 20 Jul 2025), and two-pass matching strategies (Shelukhan et al., 25 Nov 2025).

1. Canonical Architecture and Algorithmic Flow

System 2 Stage Tracking organizes the tracking process into two sequential association steps, each with distinct function and mathematical formalization.

Stage 1: High-Purity Local Association

Operates at the level of detection-to-tracklet or detection-to-region assignment in each frame.
Bipartite matching is commonly formulated via cost-minimization:

$\min_{X \in \{0,1\}^{|\mathcal T| \times |\mathcal D|}} \sum_{j,i} C_{j,i} X_{j,i}$

subject to at most one-to-one assignment per row/column (Guo et al., 2024), typically solved by the Hungarian algorithm.

Matching cost is often constructed as a convex combination of appearance (ReID embedding distance) and motion cues (IoU of bounding boxes, Mahalanobis or BBD distance) (Guo et al., 2024, Shelukhan et al., 25 Nov 2025):

$C_{j,i} = \lambda_{\text{IoU}} \bigl(1 - \text{IoU}(\hat b_j, b_i)\bigr) + \lambda_{\text{app}} \| \mathbf{a}(T_j) - \mathbf{a}(d_i) \|_2$

with cost-thresholding to guarantee high-purity fragments ( $\tau$ values such as $0.2$ for $\sim93\%$ purity in RTAT (Guo et al., 2024)).

Unmatched detections start new tracklets, unmatched tracklets are closed and finalized.

Stage 2: Hierarchical Contextual Reasoning and Tracklet Fusion

Operates at the level of merging short tracklets (or points/regions) into full trajectories or temporally coherent tracks.
Techniques include hierarchical message-passing Graph Neural Networks (GNNs) (Guo et al., 2024); min-cost flow graph correction (Li et al., 2023); cost-matrix matching using geometric and appearance features (Dao et al., 2021).
Graph-based frameworks define nodes as tracklets and edges as candidate links, with edge features combining spatial, temporal, scale, and multi-level appearance cues.
The GNN performs $L$ message-passing rounds; edge classification yields link probabilities for merging into longer tracks:

$p_{ij} = \sigma(\psi(e_{ij}^{(L)})), \quad \hat y_{ij} = \mathbf{1}\{p_{ij}>0.5\}$

Hierarchical coarsening (repeating on newly merged components) reduces class imbalance and computational complexity (Guo et al., 2024).

Algorithmic Summary Example (RTAT) (Guo et al., 2024):

for t in range(1, T):
    # Stage 1: Local association via Hungarian algorithm with cost threshold
    for each match (j,i) with C_{j,i} <= tau:
        append d_i to T_j
    unmatched d_i: start new tracklet; unmatched T_j: close tracklet
for level in range(1, H):
    build sparse graph G_lvl
    L rounds of message passing
    classify edges and merge along active links
return complete trajectories

2. Representative Methodologies

System 2 Stage Tracking is widely instantiated in diverse forms:

Architecture	Stage 1 Purpose	Stage 2 Purpose
RTAT (Guo et al., 2024)	High-purity detection-tracklet assignment (Hungarian, cost-threshold)	Global tracklet-merging via hierarchical message passing GNN
StableTrack (Shelukhan et al., 25 Nov 2025)	Bbox-based distance (BBD) appearance matching	IoU-gated fallback association; integrates visual tracker into KF
TSMCF (Li et al., 2023)	Min-cost flow on high-confidence detections	Correction in occluded regions using intersection mask and low-confidence nodes
BleedOrigin-Net (Xu et al., 20 Jul 2025)	Event onset and spatial source detection	Temporal fine-grained point tracking using transformer and pseudo-labels
Two-Stage Data Assoc. (Dao et al., 2021)	High-confidence local matching (LAP)	Low-confidence, fragmented tracklet recovery and global reassociation
Cascaded Regression (Wang et al., 2020)	Dense CNN regression for easy cases	Discrete ridge regression for hard distractors and ambiguous samples

Each instantiation is tailored to the statistical structure of its domain, for example, hierarchical graphs in associating MOT tracklets (Guo et al., 2024), spatial/temporal intersection masks in occlusion recovery (Li et al., 2023), temporal transformers in surgical point tracking (Xu et al., 20 Jul 2025), and ridge regression for visual distractor rejection (Wang et al., 2020).

3. Mathematical Formalizations and Cost Functions

The mathematical structure of System 2 is driven by cost-based optimization and hierarchical reasoning:

Assignment Problems:

Hungarian algorithm for bipartite matching; cost matrix constructed from motion and appearance metrics (Guo et al., 2024, Dao et al., 2021).
Mahalanobis or BBD metric for gating:

$D_{\text{BBD}}(d_i, T_j) = \sqrt{(z_i - H \hat x_j)^T P^{-1} (z_i - H\hat x_j)}$

with covariance $P$ scaling by size and time-gap (Shelukhan et al., 25 Nov 2025).

Hierarchical Graph Reasoning:

Node and edge encodings aggregate temporal, appearance, scale, and spatial cues (Guo et al., 2024).
Recursive message-passing updates:

$e_{ij}^{(l)} = \phi_e(h_i^{(l-1)} \Vert h_j^{(l-1)} \Vert e_{ij}^{(l-1)})$

$h_i^{(l)} = \phi_h(h_i^{(l-1)} \Vert AGG\{ e_{ki}^{(l)} : k \in \mathcal{N}(i)\})$

Probabilistic Edge Classification (GNN):

Classification with sigmoid and focal loss:

$\mathcal{L}_{\text{edge}} = -\sum_{(i,j)\in E} y_{ij}(1-p_{ij})^\gamma \log p_{ij} + (1-y_{ij})p_{ij}^\gamma \log(1-p_{ij})$

Flow Optimization:

Min-cost flow formulations for global optimality, especially under occlusion (Li et al., 2023):

$f^* = \arg\min_f \sum_{(i,j)\in E} C(i,j)f_{ij}$

with constraints $\sum_i f_{ij} = \sum_k f_{jk}$ for conservation.

4. Empirical Performance and Benchmarking

System 2 Stage Tracking approaches achieve leading metrics on standard benchmarks:

MOT17 (private dets, RTAT): HOTA 67.2, IDF1 84.7, AssA 69.7; 35% fewer ID switches vs. ByteTrack (Guo et al., 2024).
MOT20 (RTAT): HOTA 66.2, IDF1 82.5, AssA 68.2; +2.3 pp IDF1 over BoT-SORT (Guo et al., 2024).
TSMCF (MOT16/17/20): MOTA 78.4/79.2/76.4, HOTA >60, high LocA (Li et al., 2023); gains of +2-3 MOTA and -442 IDS over single-pass flow by exploiting intersection mask correction.
StableTrack (MOT17-val 1 Hz): HOTA 64.9 (+11.6 pts over TrackTrack) (Shelukhan et al., 25 Nov 2025), demonstrating robustness under low-frequency detections.
BleedOrigin-Net (BleedOrigin-Bench): Initial detection 96.85% frame-level accuracy ( $\pm8$ frames), continuous tracking 96.11% @≤100 px (Xu et al., 20 Jul 2025).

These results derive from the two-stage design: initial fragmentation into highly trustworthy short segments followed by global reasoning to fuse fragments and correct errors caused by occlusion, ambiguous appearances, or detector failures.

5. Robustness to Occlusion, Fragmentation, and Distractors

A central benefit of System 2 tracking is its resilience to the principal failure modes in tracking-by-detection:

Occlusion Handling:

Intersection-mask correction (Li et al., 2023) and tracklet fusion via hierarchical GNN (Guo et al., 2024) can explicitly repair fragmented tracks in occluded intervals.

Identity Switch Reduction:

Two-stage association recovers more accurate identity assignments after false negatives and missed matches, reducing ID switches by up to 35% over standard methods (Guo et al., 2024).

Hard Distractor Rejection:

Cascaded regression (Wang et al., 2020) isolates easy negatives with fast dense regression, then discriminates hard ambiguous distractors using closed-form ridge classifiers and hard negative mining, resulting in better AUC and EAO in OTB/VOT and LaSOT/TrackingNet.

6. Domain-Specific Adaptations and Generalization

System 2 architecture is highly adaptable across domains:

Multi-object tracking: GNN and min-cost flow for MOT, including crowded and occluded scenes (Guo et al., 2024, Li et al., 2023).
Surgical and medical tracking: Detect-then-track pipelines for dynamic event localization and temporal point tracking (Xu et al., 20 Jul 2025).
Low-frequency detection environments: BBD-based gating and matching for sparsely sampled data (Shelukhan et al., 25 Nov 2025).
Visual object tracking: Dense-to-discrete cascades for hard negative mining and online distractor rejection (Wang et al., 2020).
3D tracking and robotics: Kalman filter + assignment for multi-view fusion, with clear failure modes under long occlusions (Rapado-Rincon et al., 2024, Dao et al., 2021).
Control systems: Cascade observer/controller architecture with Lyapunov-driven gain scheduling (2002.01360).

7. Limitations and Further Extensions

Fragmentation Control: Excessively conservative thresholds in Stage 1 may lead to an unmanageable number of fragments if not balanced.
Computational Complexity: Message-passing GNNs, flow optimization, and hierarchical graphs increase per-stage computational requirements, but modular separation allows for efficient parallel evaluation.
Appearance Feature Limitations: Purely geometric cues yield suboptimal results under severe occlusion or among visually similar objects; fusion with learned appearance embeddings mitigates this (Dao et al., 2021, Rapado-Rincon et al., 2024).
End-to-End Training: Some implementations maintain strict modularity; future extensions explore joint training for optimal feature sharing.

Plausible implication: System 2 Stage Tracking frameworks constitute a general paradigm for robust, modular tracking, especially suited to domains with high rates of occlusion, distractor prevalence, or sparse sampling, and offer multiple points of integration for learned and hand-crafted association mechanisms.