Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 86 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 15 tok/s
GPT-5 High 16 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 188 tok/s Pro
2000 character limit reached

SoccerNet 2025 Challenges

Updated 28 August 2025
  • SoccerNet 2025 Challenges is a benchmark comprising four vision-based tasks in football video analysis with unified datasets and protocols.
  • Methodological advances include joint team-action modeling, tailored data augmentation, and multi-modal fusion using large-scale pre-trained models.
  • Successful approaches achieved notable gains, such as a Team-mAP@1 of 60.03 and a GS-HOTA score up to 63.90, reflecting progress in sports video understanding.

The SoccerNet 2025 Challenges comprise the fifth annual instantiation of the SoccerNet open benchmarking effort, designed to drive forward computer vision research in football video understanding. This edition centered on four major vision-based tasks—Team Ball Action Spotting, Monocular Depth Estimation, Multi-View Foul Recognition, and Game State Reconstruction—each defined by large-scale annotated datasets, unified evaluation protocols, and strong open-source baselines. The challenges aimed to advance reproducible, open research at the intersection of computer vision, machine learning, and football analytics, reflecting the escalating complexity and practical relevance of video understanding in sports (Giancola et al., 26 Aug 2025).

1. Team Ball Action Spotting

The Team Ball Action Spotting task focused on temporally localizing on-ball actions in broadcast football videos and, uniquely, assigning each detected action to the appropriate team as seen from the camera’s view. The challenge involved 12 action classes, increasing temporal detection granularity and semantic complexity by adding team attribution.

Methodologically, most top entries began with the T-Deed baseline for ball action spotting. Innovations were primarily centered on the fusion of team information. The winning approach abandoned separate heads for action class and team assignment in favor of a joint head that predicts team–action combinations directly, consolidating the state space and reducing redundant non-action phases. Extensive data augmentation was commonly used, including horizontal flips (with label swapping for team orientation), brightness changes, crop strategies, and artificial camera-pan transformations to promote generalization.

Evaluation relied on a team-aware mean Average Precision (Team-mAP@1) at a tight 1-second tolerance: for each action–team pair, AP was calculated and then combined by weighting according to the number of ground-truth events for each team. This allowed balanced performance measurement across classes. The best solution achieved a Team-mAP@1 of 60.03, an improvement of more than 8 points over the prior benchmark.

2. Monocular Depth Estimation

Monocular Depth Estimation addressed the recovery of scene geometry from single-camera football broadcasts via per-pixel relative depth prediction. Unlike metric depth estimation, this task emphasized ordinal relationships, relevant for tasks such as player localization and offside detection amid dynamic broadcast conditions.

Top approaches fine-tuned large-scale vision transformers (e.g., Depth Anything V2 with ViT-L backbone), employing combined loss functions that accounted for scale-and-shift invariance as well as gradient matching (SSI and SSIGM losses) to enforce global and local structure consistency. Fine-tuning was performed on soccer-specific datasets at full resolution, supplemented by domain-relevant augmentations (e.g., grass hue normalization, motion blur).

The main metric, Root Mean Square Error (RMSE), was defined as: RMSE=1H×Wi,j(zijz^ij)2RMSE = \sqrt{\frac{1}{H \times W} \sum_{i,j} (z_{ij} - \hat{z}_{ij})^2} where zijz_{ij} and z^ij\hat{z}_{ij} are ground-truth and predicted depth values respectively at pixel (i,j)(i, j). This metric directly measured the fidelity of the ordinal depth relationship. Relative to the baseline, leading submissions exhibited significantly reduced RMSE through tailored augmentation and loss stratification.

3. Multi-View Foul Recognition

The Multi-View Foul Recognition challenge required systems to classify both the type and severity of foul incidents across multiple time-synchronized camera views. Each instance included fine-grained categorizations (eight foul types, and severity: no offence, yellow, red). Critical to this challenge was the aggregation of evidence from variable viewpoints, which included live and replay angles.

Competitive solutions leveraged large video transformer architectures (e.g., TAdaFormer-L/14) pre-trained on massive action-recognition corpora. To aggregate across viewpoints, view embeddings were learned and multi-view feature representations underwent max pooling, treating live angles as primary and replays as auxiliary. Beyond deep architectures, zero-shot approaches using video-LLMs with tailored prompt engineering (encoding soccer rules, context, and chain-of-thought) were tested, exploiting LLMs' priors for recognition.

Balanced accuracy was the primary metric, averaging the correctly classified proportion over all foul types and severity levels: Balanced Accuracy=1Ni=1NTPiPi\text{Balanced Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \frac{TP_i}{P_i} for NN classes, where TPiTP_i is the count of true positives and PiP_i is the ground-truth count for class ii. The best balanced accuracy reached 52.22%, indicating robust progress on both type and severity identification.

4. Game State Reconstruction

In Game State Reconstruction, the objective was to localize and assign identity attributes (role, team, jersey number) to all players and referees from a single broadcast perspective, reconstructing the game state as a 2D top-view minimap per frame. This task demands temporal association, spatial calibration, and fine-grained re-identification under broadcast-specific challenges such as occlusion and camera movement.

State-of-the-art methods used a pipeline combining high-performance detectors (YOLO-X, new YOLO variants) and trackers (Deep-EIoU) with OSNet-based re-identification. Novelty arose in the matching mechanisms—specifically the GS-HOTA metric, adapted to enforce both high-precision localization and strict identity consistency: SimGSHOTA(P,G)=LocSim(P,G)×IdSim(P,G)\mathrm{Sim}_{GS-HOTA}(P, G) = \mathrm{LocSim}(P, G) \times \mathrm{IdSim}(P, G) where: LocSim(P,G)=exp(ln(0.05)PG2τ2)\mathrm{LocSim}(P, G) = \exp(\ln(0.05) \frac{||P - G||^2}{\tau^2}) and IdSim(P,G)=1\mathrm{IdSim}(P, G) = 1 if the role, team, and jersey number match; otherwise $0$. Final scores, measured by GS-HOTA, ranged as high as 63.90 for leading teams, up from baseline values near 29.

Tracklet refinement strategies included split-and-merge operations based on jersey number and team, supplemented by vision-LLMs (e.g., LLaMA-Vision) for context-aware attribute recovery from ambiguous detections. This multi-stage refinement proved crucial for maintaining consistency in long, occlusion-prone broadcast sequences.

5. Methodological Advances and Insights

Major innovations across tasks included: unified prediction heads for joint action-team modeling; domain-specific data augmentation; hybrid loss functions that jointly optimize multiple objectives (e.g., SSI/SSIGM for depth, focal loss for spotting); and multi-modal fusion (video, language, view aggregation).

Pretrained large-scale models (transformers on Kinetics, vision-language or vision-only large models) were omnipresent, especially when coupled with soccer-specific fine-tuning. Ensemble strategies and staged training were used to mitigate overfitting and bridge the gap between training and inference setups, particularly in multi-camera scenarios.

A trend toward integrating contextual reasoning—using prompt-based LLMs for foul understanding or contextual tracklet merging in GSR—emerged as critical for dealing with incomplete or ambiguous visual cues.

6. Datasets, Evaluation Protocols, and Tools

All tasks were supported by newly released or expanded datasets:

  • SoccerNet Ball Action Spotting (with team labels),
  • SoccerNet-Depth (synthetic and real data for ordinal monocular depth),
  • SoccerNet-MVFouls (multi-synchronized-view foul events),
  • SoccerNet-GSR (dense game state annotations: player positions, tracking, identity).

Each task provided unified protocols and open repositories including development kits and baselines, e.g., Team-mAP@1 for action spotting, RMSE for depth, GS-HOTA for game state, and balanced accuracy for multiview foul recognition. Tools were disseminated via GitHub, reinforcing open science and cross-institutional replication (Giancola et al., 26 Aug 2025).

7. Challenges, Limitations, and Future Outlook

Despite significant progress, difficulties remain: robust temporal localization at scale (particularly under rapid ball movement or overlapping actions), identity assignment amid frequent occlusions, and fusing multi-view evidence (especially with varying broadcast standards). The inference–training discrepancy, especially regarding temporal or camera variation, motivated staged learning and adaptive aggregation approaches.

For future editions, methodological extension toward truly fine-grained real-time inference, additional scene context (audio, external event/tactical feeds), and more advanced handling of complex spatiotemporal associations are likely directions. Advances in vision-LLMing, self-supervised learning, and robust cross-modal aggregation are also anticipated to play a prominent role.


In sum, SoccerNet 2025 expanded the technical frontier of football video understanding through rigorous evaluation, state-of-the-art methods, and a robust open science ecosystem across ball action spotting with team attribution, monocular depth estimation, multi-view foul classification, and comprehensive game state reconstruction. The integration of deep learning, domain-specific innovations, and large-scale annotated data has established a new reference standard for computer vision in sports analysis (Giancola et al., 26 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube