- The paper introduces a novel pipeline integrating YOLOv5m, SegFormer, and DeepSORT to accurately reconstruct soccer game states from broadcast footage.
- It employs advanced techniques in camera calibration and team detection, achieving a GS-HOTA score of 63.81 in the SoccerNet GSR Challenge.
- The methodology offers practical applications in professional sports analytics by enabling precise player tracking and tactical evaluation.
From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction
Introduction
The paper "From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction" presents a comprehensive solution for Game State Reconstruction (GSR) in sports analytics, specifically focusing on soccer. This task involves tracking and identifying players, goalkeepers, referees, and other field participants in real-world coordinates from broadcast footage using a single-camera setup. The proposed pipeline achieves state-of-the-art performance in this domain, as evidenced by its successful deployment in the 2024 SoccerNet Game State Reconstruction Challenge.
GSR from video streams presents unique challenges, including occlusions, camera movements, and distinguishing between visually similar players. The authors address these challenges by integrating several cutting-edge approaches: enhanced object detection using YOLOv5m, a SegFormer-based camera parameter estimator, and a DeepSORT-based tracking framework coupled with re-identification (ReID) embeddings and jersey number recognition.
Methodology
The methodology is structured into three primary stages: raw tracking, team detection, and post-processing. Each stage contributes distinct elements essential for effective game state reconstruction.
Raw Tracking Stage
The raw tracking stage involves the use of YOLOv5m for object detection, focusing on optimizing detection for soccer-specific objects, i.e., players and the ball. This stage performs initial processing to generate player tracks in real-time, harnessing pitch localization and jersey number recognition.
Figure 1: Raw tracking stage performs object detection, pitch localization, collects information about players teams required on consequent stages, Re-ID embeddings, jersey numbers and then merges all collected data into preliminary object tracks using the DeepSort-based tracking.
The pipeline employs a customized camera parameter estimation model based on SegFormer architecture. This model predicts camera parameters (position, orientation, field of view) and is refined using detected keypoints (field markings) for improved accuracy in mapping objects from image to real-world coordinates.
Team Detection Stage
Team detection aggregates information about player uniforms and roles, clustering into team-specific embeddings through unsupervised methods like k-means.
Figure 2: Team Detection Process. (a) Frames are clustered into three main groups: the two largest clusters (left and right teams) and the referee cluster. (b) Goalkeeper detection is performed separately by identifying athletes inside the penalty area and clustering them based on embeddings.
During this stage, the pipeline distinguishes team affiliations via uniform-specific ReID embeddings enhanced with role prediction.
Post-Processing Stage
Post-processing focuses on merging fragmented player tracks into coherent trajectories. Advanced techniques use jersey numbers, team labels, and ReID vectors to correct tracking errors, eliminate identity swaps, and ensure temporal consistency.
This stage plays a critical role in the success of the pipeline, with sophisticated fusion techniques that integrate ReID feature vectors and jersey number recognition for transient scenarios where visual differentiation is hindered.
Evaluation and Results
The evaluation of the proposed method was conducted using the SoccerNet Game State Reconstruction (GSR) Challenge dataset. Results were measured with the GS-HOTA metric, extending the standard HOTA by emphasizing roles, team affiliations, and jersey numbers for rigorous tracking requirements.
Constructing Tech's solution achieved the highest score in the challenge with a GS-HOTA score of 63.81, outperforming competing teams and baselines significantly.
Conclusion and Implications
This paper effectively combines multiple advanced techniques for addressing the complexities inherent in GSR from soccer broadcast footage. With its modular design and robust optimization of camera and tracking systems, it sets a new standard in sports video understanding.
Future work is set to refine model integration, with plans to unify camera calibration and field detection models under a more comprehensive architecture. Improvements in orientation prediction and jersey number association are also expected to further enhance the capabilities of this pipeline.
Overall, the implications of this research lie in its potential applications within professional sports analytics, providing coaches and analysts with a sophisticated toolset for player performance evaluation and tactical decision-making. The successful implementation in a competitive challenge underscores its practical viability and lays the groundwork for future explorations in AI-driven sports analysis.