FOOTPASS: Soccer Action Spotting Dataset

Updated 27 November 2025

FOOTPASS is a large-scale, multi-modal soccer dataset featuring 81 hours of broadcast video and 102,992 on-ball event annotations.
It provides comprehensive modalities including video, single-player tracklets, and detailed game-state data for 22 players per frame.
Baseline methods integrating GNN and sequence-level tactical priors show significant improvements, raising F1 scores from 35.9 to 67.5.

The Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS) is a large-scale benchmark designed for the comprehensive study of player-centric action spotting in broadcast soccer videos, contextualized within a multi-modal, multi-agent tactical environment. It directly supports the development and evaluation of methods that jointly exploit computer vision outputs (e.g., tracking, action detection) and soccer-specific tactical priors to generate reliable, event-level play-by-play data streams, a critical foundation for modern sports analytics (Ochin et al., 20 Nov 2025).

1. Dataset Composition and Modalities

FOOTPASS comprises 54 complete men's soccer matches from the 2023/24 season, including competitions such as Ligue 1, Bundesliga, Serie A, La Liga, and UEFA Champions League. Each match is recorded in 1920×1080 resolution at 25 fps, resulting in approximately 81 hours of broadcast video and 102,992 manually-validated play-by-play on-ball event annotations.

The dataset adopts a rigorously multi-modal and multi-agent data structure. For every frame (excluding replays), the following modalities are provided:

Raw broadcast video (Full HD, 25 fps),
Single-player tracklets, consisting of screen-space bounding boxes for event actors (with interpolation over gaps ≤ 50 frames),
World-plane game-state for all 22 players, capturing 2D positions, velocities, team (binary), jersey number, and a coarse tactical "role" (from 13 predefined categories: GK, LB, LCB, MCB, RCB, LM, RM, DM, AM, LW, RW, CF, RB),
Replay mask (flag marking live vs. replayed frames).

Table 1. Dataset Statistics

Attribute	Value
Number of matches	54
Total video duration	4,860 min (≈81 hr)
Annotation count	102,992 events
Video resolution / FPS	1920×1080 / 25 fps
Modalities per frame	Video, tracklets, game-state, replays
Players tracked	22 per frame

2. Event Taxonomy and Annotation Paradigm

FOOTPASS encodes on-ball atomic action events using a "spotting" (anchor-based) representation. Each action is temporally grounded by a single anchor frame, coinciding with a ball contact or exit.

There are eight event classes:

Drive: Reception followed by dribbling (anchor: first ball control);
Pass: Intentional strike to a teammate (anchor: frame of contact);
Cross: Pass into the penalty area from its exterior (anchor: frame of contact);
Shot: Intentional strike at goal (anchor: frame of contact);
Header: Head-ball contact event (anchor: frame of head contact);
Throw-in: Ball release post-touchline crossing (anchor: release);
Tackle: Legal dispossession of opponent (anchor: ball contact);
Block: Intercepting a pass or shot (anchor: interception).

Each annotated event is stored as a tuple: $e = (t, c, \tau, j)$ where $t$ denotes frame index, $c$ the action class, $\tau$ team identity, and $j$ the acting player's jersey number. Human annotators provide team and jersey details for all events, enabling robust agent identification even in ∼18.5% of cases where bounding boxes are absent.

A "background" class is added at the frame level, yielding nine exclusive frame/player labels. The dataset does not supply a hierarchical event taxonomy, but tactical context is inherently encoded via role, team, and position features.

3. Tactical Structures and Priors

Multi-agent interactions are represented through the full game-state matrix $X_t$ , a $22\times d$ tensor including position, velocity, team, and role vectors for all players at each frame. This configuration supports two central families of tactical priors:

Local context graphs (GNN priors): Edges connect spatially proximate players, optionally encoding relative team and role information. Edge features enable models such as TAAD+GNN to leverage nearby player context for action classification.
Sequence-level tactical priors (DST): Transformer-style denoising across up to 750 frames conditions on role, team, and action logits to regularize predictions, enforcing continuity (e.g., ball possession) and increasing event sequence plausibility over extended time horizons.

Player roles, assigned from 13 categories, supply the coarse tactical structure necessary for many forms of context-driven reasoning (e.g., wingers are more likely to perform crosses, central backs to block), and allow long-range context encoding.

4. Evaluation Protocols and Metrics

Evaluation follows an anchor-based temporal-matching paradigm. A prediction $\hat e=(\hat t, \hat c, \hat\tau, \hat j)$ is a true positive if it matches class, team, jersey, and is within a temporal window $|\hat t - t| \leq 12$ frames (approximately 0.48 seconds) of a ground truth event. Unmatched predictions are counted as false positives; unmatched ground truths as false negatives.

Per-class Average Precision (AP) is computed via the standard 11-point interpolated precision–recall curve: $\mathrm{AP}_c = \sum_{r \in \{0, 0.1, \dots, 1\}} \max_{\tilde r \ge r} \bigl(\mathrm{P}\bigl(\mathrm{R} \ge \tilde r\bigr)\bigr)$ Alternatively, discrete AP is: $\mathrm{AP}_c = \sum_{n=1}^{N_c}\mathrm{P}(n)\;\Delta \mathrm{R}(n)$ where $N_c$ is the number of predictions for class $c$ . Mean Average Precision (mAP) across all classes is: $\mathrm{mAP} = \frac{1}{C} \sum_{c=1}^C \mathrm{AP}_c$ Precision, recall, and F1 metrics are defined threshold-wise, with explicit formulas: $\mathrm{Precision}(\tau) = \frac{\#\mathrm{TP}(\tau)}{\#\mathrm{TP}(\tau) + \#\mathrm{FP}(\tau)}$

$\mathrm{Recall}(\tau) = \frac{\#\mathrm{TP}(\tau)}{\#\mathrm{TP}(\tau) + \#\mathrm{FN}(\tau)}$

$\mathrm{F1}(\tau) = 2\;\frac{\mathrm{Precision}(\tau) \times \mathrm{Recall}(\tau)}{\mathrm{Precision}(\tau) + \mathrm{Recall}(\tau)}$

Precision–Recall curves are generated by threshold sweep.

5. Baseline Methods and Comparative Results

Three baseline methods are implemented to quantify the impact of visual, graphical, and long-range tactical modeling:

TAAD (Track-Aware Action Detector): A clip-level STAD model operating on short (2 s) player tubes, combining detection, tracking, 3D CNN, ROI-Align, temporal convolution, and post-processing via temporal NMS.
TAAD + GNN: Extends TAAD with a temporally and spatially grounded player graph, Edge-Conv layers, and fusion of tactical features.
TAAD + DST (Denoising Sequence Transduction): Applies a Transformer-like denoising model to TAAD’s sequence logits (up to 750 frames), exploiting roles and team priors to enforce sequence-level consistency and tactical plausibility.

Table 2. Overall Precision, Recall, F1 ( $\tau=0.15, \delta=12$ )

Method	Precision	Recall	F1
TAAD	25.6	59.9	35.9
TAAD+GNN	44.5	62.7	52.1
TAAD+DST	68.2	66.8	67.5

Table 3. Per-class Average Precision ( $\times 100$ )

Class	Pass	Drive	Shot	Throw-in	Cross	Header	Block	Tackle
TAAD	48.9	39.3	39.8	39.6	46.4	23.4	13.1	1.5
TAAD+GNN	55.7	54.8	44.2	35.5	41.6	21.3	7.5	1.8
TAAD+DST	64.2	57.3	57.2	49.9	43.1	14.7	7.8	0.0

Classes with sparse occurrences (Header, Block, Tackle) remain substantial failure cases. DST improves recall on events where the acting player is not visible (~18%), achieving ∼33% recall versus ≲4% for the other baselines, by leveraging sequence context and tactical priors. Class imbalance is substantial, with Drives and Passes comprising ~90% of events; sequence modeling with DST dramatically raises F1 for these categories (e.g., Drive: 34%→68%). Tactical reasoning also raises overall precision from 25% (TAAD) to 68% (TAAD+DST), largely by reducing isolated false positives.

6. Distribution, Applications, and Research Outlook

FOOTPASS is fully public, with annotations and baseline code available through GitHub and HuggingFace; video frames require access via the established SoccerNet NDA. The resource supports research in spatiotemporal action detection, multi-agent sequence modeling, graph-structured deep learning, and tactical modeling, with broad application in automated sports analytics.

A plausible implication is the enabling of methods combining low-level perceptual cues and high-level tactical priors, stimulating further integration of vision and structured reasoning in sports analysis contexts. The benchmark’s public release standardizes evaluation for player-centric action spotting and offers a platform for future developments in reliable play-by-play event extraction (Ochin et al., 20 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Footovision Play-by-Play Action Spotting in Soccer Dataset (FOOTPASS).