SoccerNet Dataset: Sports Video Benchmark

Updated 8 September 2025

SoccerNet is a large-scale, multi-modal dataset offering precise temporal and spatial annotations for soccer broadcast analysis.
Its scalable annotation pipeline combines OCR, RANSAC alignment, and manual refinement to achieve one-second event anchoring across 500 full-length matches.
The dataset supports diverse benchmarks including action spotting, tracking, camera calibration, and video summarization with rigorous evaluation protocols.

SoccerNet is a comprehensive suite of large-scale, multi-modal benchmark datasets and associated tasks for sports video analysis centered on professional soccer broadcast footage. Originating with the release of a scalable annotated action spotting corpus, SoccerNet has since evolved into a unified platform that offers high-density temporal and spatial annotations, multimodal content, and rigorous evaluation protocols. It is extensively used for advancing research in action detection, player and ball tracking, camera calibration, video summarization, captioning, 3D scene understanding, and automated broadcast analytics.

1. Dataset Genesis and Core Composition

SoccerNet was introduced as one of the first large-scale public benchmarks tackling the action spotting problem in long, untrimmed soccer broadcasts (Giancola et al., 2018). The v1 dataset consists of 500 full-length matches (764 hours in total) from six top-tier European leagues, collected across the 2014–2017 seasons. Annotation was first carried out in three main event classes—goal, card (yellow/red), and substitution—via automatic parsing from online match reports at a coarse one-minute resolution. This was followed by manual refinement to one-second anchoring via tightly defined soccer event rules. Synchronization between broadcast time and true game time was enforced using OCR and RANSAC-based alignment. The resulting corpus comprises 6,637 precisely anchored events, with an average density of one event every 6.9 minutes, making it ideally suited for “sparse temporal anchoring” benchmarks.

The key structural elements at launch were:

Component	Description	Scale
#Games	Premier leagues (Serie A, Premier League, La Liga, Bundesliga, etc.)	500
Duration	Full-length, untrimmed matches	764 hours
Event Classes	Goal, Yellow/Red Card, Substitution	3
#Temporal Annotations	One-second resolution, manually refined	6,637

Subsequent releases (v2 and onwards) expanded both coverage and richness:

SoccerNet-v2 introduced ~300,000 annotations over 500 games, including 17 action classes (with semantic tags for visibility), 158,500 camera shot delimiters (spanning 13 shot classes and multiple transition types), and 32,900 replay-action linkages (Deliège et al., 2020).
SoccerNet-v3 and v3D extended the paradigm to spatial (tracking, re-identification, depth, and 3D localization) and multimodal domains (audio augmentations), alongside supporting the development of domain-specific algorithms and leaderboards.

2. Annotation Procedures, Scalability, and Multimodal Extensions

Annotation relies on hybrid pipelines: coarse, widely available tabular data is refined through human raters following domain rules. Games are aligned in time via game clock OCR, with RANSAC-based linear mapping, and each event (goal, card, substitution) is anchored to a precise game frame.

Scalability is a core feature; by leveraging structured match reports, annotations scale linearly with additional games, with manual refinement estimated at under 10 minutes per match. This pipeline proved extensible: SoccerNet has been expanded to comprise hundreds of thousands of annotations (e.g., 171,778 events over 13,489 games mentioned as feasible), and provides a template for new leagues or event types to be incorporated with minimal cost.

SoccerNet is not limited to vision:

SoccerNet-Echoes augments the dataset with Whisper-based ASR transcriptions of broadcast commentaries, auto-translated into English for uniformity, and distributed in structured JSON aligned to video frames (Gautam et al., 12 May 2024). This supports multi-modal research (text, audio, vision) and multimodal learning.
Recent releases provide replays synchronized with live feeds, multi-view sequencing, and audio/text-video alignment for joint analysis.

3. Benchmark Tasks and Evaluation Protocols

SoccerNet pioneered several fine-grained sports video understanding tasks, including:

Action Spotting:

The canonical task requires localizing “anchor” timestamps for events in vast, untrimmed video. Unlike action detection, which predicts temporal segments, action spotting outputs a single timestamp per event—posing an extreme-sparsity challenge. Events are deemed correct if their predicted timestamp lies within a configurable tolerance δ of ground-truth ( $|t_{\text{pred}} - t_{\text{gt}}| \leq \delta$ ). Evaluation relies on mean Average Precision (mAP) aggregated across classes and multiple δ-tolerances, with Aggregate Average-mAP (area under mAP-δ curve) for overall system ranking (Giancola et al., 2018).

Expanded Benchmarks (v2+):

Action spotting with 17 classes and custom visibility tags.
Camera shot segmentation: framewise classification into 13 shot classes, boundary detection at shot transitions (evaluated by mIoU and mAP@1s respectively).
Replay grounding: linking replay segments to their original live action anchors via Siamese architectures.

Each task is supported by corresponding open-source baselines (e.g., temporal NetVLAD, SoftDBOW, MaxPool, CALF—context-aware loss), pre-computed visual features, and unified data splits for fair benchmarking (Deliège et al., 2020).

Multi-modal and higher-order tasks include:

Player and ball multiple object tracking benchmarks with unique track ID consistency evaluation (using HOTA/DetA/AssA) (Cioppa et al., 2022).
Camera calibration and player localization pipelines, yielding field-adaptable homographies and real-world player projections (Cioppa et al., 2021).
3D ball localization, monocular and multi-view, leveraging calibrated cameras, triangulation, and bounding box refinement (Gutiérrez-Pérez et al., 14 Apr 2025).
Automatic dense video captioning tied to temporally-anchored events, with metrics covering syntax (BLEU, CIDEr), lexical semantics (soccer-specific “significant words”), and diversity (Hammoudeh et al., 2022, Mkhallati et al., 2023, Ruan et al., 31 Oct 2024).
Video summarization (SoccerHigh) with match/summary alignment, shot-overlap annotation, and F1 score constrained to the ground truth duration (Díaz-Juan et al., 1 Sep 2025).

4. Model Architectures and Baselines

Baseline models work within a modular paradigm, typically decomposing into feature extraction, temporal aggregation, and event prediction heads. Early approaches used dimensionality-reduced (PCA) outputs from C3D, I3D, or ResNet-152 as frame-level features, with temporal pooling via mean/max, temporal CNN, or learnable aggregation modules such as NetVLAD and NetRVLAD. For action spotting, applying ResNet-152 with a NetVLAD layer (k=512 clusters) achieved an mAP of 67.8% for one-minute segment classification and an Average-mAP of 49.7% on spotting with δ tolerances between 5 and 60 seconds (Giancola et al., 2018).

SoccerNet-v2 and later versions advocate strong baselines, adding:

GCNs for graph-based player representations.
Attention-based temporal pooling (e.g., CALF).
Multi-modal fusion: vision-language transformers for captioning, Whisper-based ASR for audio, fusion with scene graphs, and layout/top-view synthesis for spatial reasoning.

Recent community approaches in summary generation, monocular depth estimation, and team-action spotting increasingly leverage transformers, vision-language pretraining, and multimodal integration (Giancola et al., 26 Aug 2025, Díaz-Juan et al., 1 Sep 2025).

5. Impact on Computer Vision Research and Sports Analytics

SoccerNet’s influence is observable in several domains:

Benchmark Standardization: It established mAP-based evaluation as the standard for spotting and action localization in sports.
Task Generalization: The annotation protocol and scalable methodology have inspired similar datasets in other sports.
Open Science: Each task releases data splits, code, models, and leaderboards, fostering reproducibility.
Industrial Translation: Action spotting workflows underpin commercial highlight detection, automated annotation tools for broadcasters, and advanced analytic dashboards for clubs and federations.

A plausible implication is that SoccerNet's design—especially its tight coupling of annotation protocols, unified evaluation, and extensible pipeline—has accelerated adoption of modern deep learning methods (transformer-based, multimodal, 3D geometry-driven) in sports analytics.

6. Recent Developments and Ongoing Community Challenges

SoccerNet continues to expand its scope:

Recent editions support 3D scene reconstruction (SoccerNet-v3D), monocular and multi-view geometry for object localization, and minimap-based game state analysis (Somers et al., 17 Apr 2024, Golovkin et al., 8 Apr 2025).
The 2025 challenges introduced tasks for team ball action spotting (joint action and team prediction via unified heads), monocular depth estimation (relative, SI-invariant), multi-view foul recognition (balanced accuracy over class-imbalanced foul types and severities), and full game state minimap reconstruction (GS-HOTA metric) (Giancola et al., 26 Aug 2025).
Modular frameworks, Bayesian hyperparameter optimization, and multi-stage, auditable pipelines are increasingly reflected in top-ranked competition entries.

A notable feature in community editions is the transition towards holistic, multimodal video understanding: integrating synchronized replays, camera shot segmentation, event captioning, depth/geometry, ASR-transcribed commentary, and even automated referee support systems for foul detection.

7. Accessibility and Reproducibility

All core datasets, annotations, codebases, and challenge leaderboards are publicly accessible:

Project repository: https://silviogiancola.github.io/SoccerNet
Main platform: https://www.soccer-net.org
Subdomain repositories (for tasks such as tracking, captioning, calibration, etc.) are linked from the main website and task-specific GitHub pages.

Datasets include pre-processed visual features and annotations suitable for direct use in academic and industrial pipelines. Continuous updates, including depth maps, minimap geometry, dense captions, and shot-level alignments for summarization, facilitate rapid prototyping and comparative studies.

SoccerNet stands as a cornerstone in sports video understanding, providing a multilayered resource for benchmarking, method development, and community advancement in automated soccer analytics. Its extensible design and rigorously documented protocols exemplify best practices in dataset construction and open research in computer vision.