Papers
Topics
Authors
Recent
2000 character limit reached

SoccerNet-Echoes: Multimodal Soccer Analytics

Updated 19 December 2025
  • SoccerNet-Echoes is a large-scale multimodal dataset integrating video, audio, and automatically transcribed commentary from broadcast soccer matches.
  • The dataset utilizes the Whisper ASR model and Google Translate to generate aligned captions, with performance evaluated via WER, CER, and BLEU scores.
  • It supports applications such as event spotting, highlight generation, dense captioning, and tactical analysis by synchronizing commentary with game events.

SoccerNet-Echoes is a large-scale multimodal dataset that augments the widely used SoccerNet benchmark with automatically transcribed and translated live audio commentaries from broadcast soccer matches. This resource connects visual, auditory, and linguistic modalities by providing time-aligned commentary text alongside match video and audio, significantly expanding the landscape for data-driven sports analytics. Transcriptions are generated using the Whisper automatic speech recognition (ASR) model and translated to English via Google Translate, enabling additional applications in event spotting, highlight generation, game summarization, and tactical analysis. SoccerNet-Echoes encompasses ∼825 hours of aligned multilingual commentary from 550 matches covering major European leagues and competitions, and inherits SoccerNet’s train/validation/test splits to facilitate systematic experimentation (Gautam et al., 12 May 2024).

1. Dataset Composition and Statistical Overview

SoccerNet-Echoes consists of 1,100 soccer match halves, representing 550 full matches. The dataset covers approximately 49,500 minutes (825 hours) of audio commentary extracted from mainstream European competitions across multiple seasons, with language distribution and seasonal segmentation as shown below.

Language Distribution per Half (Table 1):

Language Count % of 1,100
English 297 27.0 %
Spanish 264 24.0 %
Russian 218 19.8 %
German 135 12.3 %
French 102 9.3 %
Turkish 4 0.4 %
Italian 4 0.4 %
Polish 2 0.2 %
Bosnian 2 0.2 %
Hungarian 2 0.2 %
Not available 70 6.4 %

League-by-Season Breakdown (Halves):

Season EPL UCL Ligue 1 Bundesliga Serie A La Liga Total halves
2014–15 6 37 1 8 11 18 160
2015–16 49 45 3 18 9 36 300
2016–17 49 26 43 35 85 63 301
2019–20 0 0 0 0 0 8 16

SoccerNet-Echoes maintains the original SoccerNet game-level splits (typical: 300 train / 50 val / 150 test out of 500 games; 50 additional games apportioned identically). Each half is annotated with metadata: league, season, date, home/away teams, final score, and accompanying ASR JSON files listing segments as \langlestart_s, end_s, text\rangle.

2. Automatic Speech Recognition Pipeline and Metrics

Whisper functions as the backbone ASR system, with three model variants (large-v1, large-v2, large-v3; commit ba3f3cd) employed for transcription. Input audio is downmixed to mono and resampled to 16 kHz, with no explicit denoising or voice-activity detection. The pipeline detects audio language from the first 30 seconds before either transcribing directly (for English) or passing non-English output to Google Translate.

Transcription performance is quantified using word error rate (WER), character error rate (CER), and BLEU score, with ground-truth evaluation available for 40 halves annotated in the GOAL dataset:

  • WER=S+D+INWER = \frac{S + D + I}{N} where SS = substitutions, DD = deletions, II = insertions, NN = total reference words.

ASR Performance Metrics (Table 2):

Model WER CER BLEU
whisper-large-v1 0.443 0.261 54.50
whisper-large-v2 0.458 0.269 52.59
whisper-large-v3 0.551 0.341 47.97

Manual intervention flagged 70 halves with empty or ambient-only audio (56 no audio track, 14 stadium noise only). No further hand corrections were performed except validation for these cases.

3. Translation Quality and Evaluation

For commentary in non-English languages (∼ 48 % of halves), SoccerNet-Echoes applies batch-mode Google Translate to produce English text. Whisper’s language detection determines translation necessity. While large-scale human verification was not performed, spot checks revealed occasional mistranslations, notably on soccer-specific terms (e.g., confusion between "free kick" and "penalty kick"). No standalone BLEU score for translation is reported; overall, the ASR pipeline BLEU reflects conflated transcription and translation noise.

The BLEU metric is calculated as:

BLEU=BPexp(n=1Nwnlogpn)BLEU = BP \cdot \exp\Bigl(\sum_{n=1}^N w_n \log p_n \Bigr)

where BPBP is the brevity penalty, pnp_n is nn-gram precision, and wnw_n is the weight (typically wn=1/Nw_n = 1/N).

Reported BLEU across the ASR pipeline ranges from 47.97 to 54.50, depending on model variant and reflecting translation quality aggregate with ASR (Gautam et al., 12 May 2024).

4. Multimodal Integration and Applications

Each commentary segment is temporally indexed (start/end in seconds), facilitating direct alignment with video frames and SoccerNet’s event timestamps (e.g., goals, fouls, substitutions). This enables segmental associations between commentary text and specific on-field events.

Key downstream applications enabled by the dataset include:

  • Enhanced Action Spotting: Integrates visual features, crowd-audio signals, and key commentary phrases (“goal!”, “foul”) to improve event precision and recall.
  • Automatic Highlight Generation: Highlights are triggered by spikes in commentator excitement or named entity mentions (e.g., player goals).
  • Dense Captioning and Summarization: Commentary text is leveraged for dense, human-like captions or game summaries, extending SoccerNet-Caption methodologies.
  • Tactical Analysis: Extraction of strategic insights (coach decisions, formations) from commentary for advanced performance evaluation.

Prior work (e.g., Vanderplaetse et al., CVPRW 2020) has demonstrated 5–10% mean average precision (mAP) improvements for action spotting by incorporating audio/text. This suggests comparable or superior gains may be attainable leveraging SoccerNet-Echoes’s time-aligned ASR data.

5. Limitations, Improvements, and Future Research

SoccerNet-Echoes faces several inherent limitations:

  • Transcription Noise: Whisper’s WER ≃ 0.44 (v1) leads to substantial error rates in high-speed, noisy commentary environments.
  • Hallucinations: Model occasionally repeats transcript segments in cases of music or crowd interference.
  • Translation Issues: Batch-mode Google Translate misinterprets soccer-specific jargon.
  • Speaker Diarization Lacking: No systematic separation between main commentators, co-analysts, and crowd segments.
  • Limited Reference Data: Only 40 halves have human-verified transcripts; the remainder are automatically generated.

Potential avenues for improvement identified include:

  • Enhanced audio preprocessing (denoising, VAD).
  • Speaker diarization for role separation.
  • Whisper fine-tuning with soccer-specific corpora.
  • Subsets with crowdsourced annotation for targeted correction.

SoccerNet-Echoes stimulates new research directions:

  • Cross-modal Retrieval: Direct querying of video via textual input ("last-minute equalizer") with segment retrieval.
  • Live ASR Analytics: In-game ASR deployment for real-time pattern detection (e.g., momentum shifts).
  • Automated Commentary Generation: Training text-to-speech models on transcripts to re-narrate historical matches.
  • Multilingual Sentiment and Tactical Analysis: Comparative studies of commentary style and content across languages.

SoccerNet-Echoes offers an extensible foundation for advancing multimodal learning and sports analytics, with scope for methodological refinement and expanded utility across diverse research domains (Gautam et al., 12 May 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SoccerNet-Echoes.