GTATrack: Hierarchical Soccer Player Tracking
- GTATrack is a hierarchical multi-object tracking system that integrates Deep-EIoU for frame-level association and GTA-Link for trajectory-level refinement.
- It employs iterative geometric expansion combined with deep appearance matching to overcome occlusion, distortion, and rapid motion in fisheye soccer scenarios.
- Global tracklet clustering and semi-supervised pseudo-labeling enhance identity preservation and reduce false positives, driving HOTA scores up to 0.60.
GTATrack is a hierarchical multi-object tracking (MOT) system that integrates @@@@1@@@@ (Deep-EIoU) for frame-level association with Global Tracklet Association (GTA) for trajectory-level refinement, targeting the challenges of soccer player tracking in fisheye camera scenarios characterized by occlusion, rapid player motion, extreme geometric distortion, and target appearance ambiguity. As the winning solution to the SoccerTrack 2025 Challenge, GTATrack achieved a primary HOTA score of 0.60 and demonstrated substantially improved identity preservation and false positive control compared to previous approaches (Jian et al., 31 Jan 2026).
1. System Architecture and Workflow
GTATrack employs a two-stage tracking stack:
- Stage 1: Online, real-time association leverages Deep-EIoU, which combines iterative geometric matching and deep appearance similarity while omitting explicit motion prediction or filtering. Association is implemented via a Hungarian solver minimizing a composite cost matrix for each frame.
- Stage 2: After initial online tracklet construction, an offline global refinement module (GTA-Link) clusters fragmented short-term tracklets into identity-consistent long trajectories by hierarchical clustering over deep appearance embeddings with spatial and temporal constraints.
The high-level pseudocode is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Inputs: video frames {I₁…I_T}
Outputs: final trajectories
// Pre-load detector 𝒟 (YOLOv11x) and ReID model ℛ (OSNet)
initialize activeTracklets = ∅
for t = 1…T do
detections O_t = 𝒟(I_t)
for each detection o_i in O_t:
crop c_i from I_t using o_i's bbox
f_i = ℛ(c_i) # D-dimensional L2-normalized vector
C = buildCostMatrix(activeTracklets, O_t) # Deep-EIoU cost terms
X* = Hungarian(C)
update activeTracklets
// After all frames:
finalTrajectories = GTA_Link(activeTracklets)
return finalTrajectories |
2. Deep Expansion IoU for Frame-Level Association
In contrast to motion-based predictors such as Kalman filters that are brittle under erratic sports motion and fisheye distortion, Deep-EIoU fuses spatial and appearance cues:
- Expansion IoU (EIoU): For each candidate association, a query bounding box is iteratively expanded (scaling by (1+s) for steps), computing IoU against target box at each expansion:
- Deep Appearance Cost: L2-normalized ReID features are compared via cosine distance:
- Composite Association Cost:
With , , expansion stages, and proximity thresholds (e.g., ), optimal discrimination of true matches is achieved in the SoccerTrack protocol (Jian et al., 31 Jan 2026).
This design robustly associates detections even when the initial IoU is low due to lens distortion or abrupt shifts, provided appearance consistency is preserved.
3. Global Tracklet Association (GTA-Link): Refinement via Clustering
The GTA-Link module addresses three principal tracking errors: intra-tracklet ID switches, fragmented trajectories due to occlusions/re-entries, and false merges.
- Pairwise Tracklet Distance: For two tracklets of lengths , average appearance distance is
- Temporal Constraint: Merges are permitted only if (), prventing non-causal associations.
- Clustering Objective: A graph is built with edge weights . Hierarchical single-linkage or DBSCAN-style clustering (eps=0.5, min_samples=7) assembles tracklets into identity-consistent clusters, optimizing
subject to the temporal constraints. Cycle-free, one-to-one linkages are enforced.
This global association step is responsible for a ∼3–4 HOTA point increase and halving the number of ID switches relative to strong baselines (Jian et al., 31 Jan 2026, Sun et al., 2024).
4. Semi-Supervised Pseudo-Labeling for Detector Training
Recalling that missed detections and false positives undermine trajectory continuity, GTATrack augments the detector’s training with a pseudo-labeling scheme:
- Generation: YOLOv11x is initially trained on official ground-truth annotations, then run on unlabeled frames. Detections with confidence are retained as pseudo-ground-truth.
- Training Integration: Batch composition is 1:1 real:pseudo, using standard YOLO losses (); pseudo losses are down-weighted (), accounting for possible label noise.
This strategy improves recall for small and distant players and results in an approximately 90% reduction in false positives: FP drops from 4913 to 494, HOTA improves from 0.38 to 0.49 (Table 3, (Jian et al., 31 Jan 2026)).
5. Performance Metrics and Experimental Evaluation
Tracking diagnostics are established by standardized multi-object metrics:
- HOTA ()
- IDSW (identity switches, lower better)
- LocA (localization accuracy)
- DetA (detection accuracy)
- AssA (association accuracy)
- FN/FP (false negatives/positives)
On the SoccerTrack 2025 test set (Table 7, (Jian et al., 31 Jan 2026)):
| Method | HOTA | IDSW | LocA | DetA | AssA | FN | FP |
|---|---|---|---|---|---|---|---|
| GTATrack | 0.60 | 331 | 0.84 | 0.76 | 0.47 | 5454.5 | 982 |
| ByteTrack | 0.42 | 630 | — | — | — | — | — |
Ablation studies show that Deep-EIoU improves HOTA by 12 points over ByteTrack and that GTA-Link with pseudo-labeling delivers the best overall performance.
6. Implementation and Open-Source Availability
GTATrack is implemented on a single NVIDIA RTX 3090 GPU. Key components include:
- Detection: YOLOv11x, input size 1280 px, batch 12.
- ReID Backbone: OSNet, D=512, L2-normalized features.
- Training: AdamW, 200 epochs, , with multi-scale augmentations.
- Deep-EIoU Parameters: , , , .
- GTA-Link Parameters: , DBSCAN eps=0.5, min_samples=7.
The full pipeline, including code for detection, Deep-EIoU, GTA-Link, and all training/inference scripts, is available at https://github.com/ron941/GTATrack-STC2025 (Jian et al., 31 Jan 2026).
7. Context, Limitations, and Generalization Potential
GTATrack is tailored for single-camera, fixed-view sports scenarios—specifically, soccer with static fisheye cameras. Spatial constraints within GTA-Link rely on fixed-field geometry, and hyperparameters (for clustering, temporal windows, pseudo-label threshold) were empirically set for SoccerTrack data. The split-and-merge paradigm demonstrated here has applicability to domains with strong visual ReID signals and challenging appearance/geometry (e.g., small distant objects, highly dynamic scenes) but would require adaptation for online or multi-camera settings (Sun et al., 2024). The synergy of geometric expansion, global appearance clustering, and semi-supervised detection refinement underpins its state-of-the-art performance in the competitive SoccerTrack 2025 context.
Key references:
- "GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association" (Jian et al., 31 Jan 2026)
- "GTA: Global Tracklet Association for Multi-Object Tracking in Sports" (Sun et al., 2024)