Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-View Foul Recognition in Soccer

Updated 23 February 2026
  • The paper introduces the VARS architecture that fuses multi-angle video inputs to jointly predict foul type and severity.
  • It employs spatio-temporal encoding with an MViT backbone and learnable attention pooling to overcome occlusions and limited views.
  • Empirical results show state-of-the-art performance on SoccerNet-MVFoul, achieving significant accuracy gains while ensuring rapid inference.

Multi-view foul recognition is the task of automatically classifying the type and severity of fouls in association football by exploiting temporally aligned multimodal video inputs from multiple uncalibrated camera feeds. Leveraging synchronized footage—including main “live” views and close-up replay angles—enables the system to overcome limitations inherent to single-view analysis, such as restricted field of view, complex occlusions, and ambiguous contact events. Recent advances in this domain have culminated in neural architectures that jointly process spatio-temporal cues from several perspectives, producing fine-grained judgments of foul categories and sanction levels with the accuracy and speed necessary for scalable, semi-automated video assistant referee support (Held et al., 2024, Held et al., 2023).

1. Problem Formulation and Scope

Multi-view foul recognition addresses the automatic prediction of two primary outcomes for a soccer incident:

  • The fine-grained foul type, among eight predefined categories, e.g., StandingTackle, Tackle, HighLeg, Holding, Pushing, Elbowing, Challenge, Dive.
  • The offense and card-severity class, such as No offense, Offense+No-Card, Offense+Yellow-Card, Offense+Red-Card.

Inputs consist of nn temporally aligned video clips {v1,,vn}\{v_1, \dots, v_n\}, each captured by a different broadcast camera. Clips are typically 1–5 seconds in length (e.g., 16 frames at 16 fps), spatially standardized (e.g., 224×224224\times224 resolution), and manually synchronized to share a common event contact frame. The goal is to fuse these multimodal streams such that both subtle and overt cues across views jointly inform the prediction, overcoming occlusions and ambiguous body configurations present in single-view analysis (Held et al., 2024, Held et al., 2023).

2. SoccerNet-MVFoul Dataset and Annotation Protocol

The SoccerNet-MVFoul dataset is the canonical benchmark for multi-view foul recognition. It contains 3,901 referee-whistled incidents from 500 professional matches (2014–2017), with an average of 2.29 views per incident (roughly 75% with one replay, 20% with two, 5% with three). For each incident:

  • 5 s “live” main broadcast clips and all available replays are extracted, temporally aligned by professional annotators to ensure the foul moment is synchronized across views.
  • A referee with 6 years’ experience (300+ matches) performs frame-level annotation, labeling 10 binary or multiclass attributes. These include the main foul type, existence and body location of contact, handball, attempt to play the ball, and five-level card severity (collapsed to four classes for modeling) (Held et al., 2023).

Class distributions are notably imbalanced: StandingTackle composes 43.6% of incidents, while Red card offenses constitute just 1.1%. The test split is approximately 20%, with standardized train/val/test partitioning inherited from SoccerNet protocols.

3. VARS Architecture: Spatio-Temporal Encoding and Multi-View Feature Aggregation

The VARS (Video Assistant Referee System) pipeline structures multi-view foul recognition into three sequential modules:

  1. Per-View Spatio-Temporal Encoder (E\mathbf{E}):
    • Each input clip viv_i is processed by an MViT (Multiscale Vision Transformer) backbone pretrained on Kinetics.
    • viRT×H×W×3v_i \in \mathbb{R}^{T\times H\times W\times 3} (T=16T=16, H=W=224H=W=224), outputting fiRdf_i \in \mathbb{R}^d capturing spatial and temporal context.
  2. Multi-View Fusion/Aggregation (A\mathbf{A}):
    • Features f=[f1;;fn]Rn×d\mathbf{f} = [f_1; \dots; f_n] \in \mathbb{R}^{n\times d} are fused via learnable attention pooling.
    • Pairwise similarities are computed using a learned WRd×dW \in \mathbb{R}^{d\times d}:

    S=fW(fW)T\mathbf{S} = \mathbf{f} W (\mathbf{f} W)^T

  • Normalized attention scores ARn\mathbf{A} \in \mathbb{R}^n weight each viewpoint, yielding a single fused representation:

    R=j=1nAjfj\mathbf{R} = \sum_{j=1}^n A_j f_j

  • This enables the model to dynamically prioritize close-up replays or informative angles for each incident.

  1. Multi-Head Classification (Cfoul\mathbf{C}^\text{foul}, Coff\mathbf{C}^\text{off}):
    • Two parallel heads predict, respectively, the foul type (y^foulΔ8\hat{y}^\text{foul} \in \Delta^8) and offense severity (y^offΔ4\hat{y}^\text{off} \in \Delta^4), each via two-layer fully-connected networks with softmax (Held et al., 2024).

Loss optimization uses an unweighted sum of cross-entropy terms for both tasks—L=Lfoul+Loff\mathcal{L} = \mathcal{L}^\text{foul} + \mathcal{L}^\text{off}—trained end-to-end with Adam. Overfitting is observed after 7–8 epochs on current data volume.

4. Evaluation Protocols and Empirical Benchmarks

Performance assessment primarily relies on top-1 classification accuracy and balanced accuracy (BA) as metrics. The latter mitigates class imbalance:

BA=1Ni=1NTPiPi\mathrm{BA} = \frac{1}{N} \sum_{i=1}^N \frac{\mathrm{TP}_i}{P_i}

where NN is the number of classes, TPi\mathrm{TP}_i is true positives, PiP_i is the number of positive samples for class ii.

Empirical results on the SoccerNet-MVFoul test set demonstrate:

  • MViT with attention pooling achieves 50%/41% (accuracy/BA) on foul type classification and 46%/34% (accuracy/BA) on severity classification, establishing a new state-of-the-art.
  • Comparative performance for common encoders and poolings is shown below:
Feature Encoder Pooling Foul Type Acc / BA Severity Acc / BA
ResNet Mean 30% / 27% 34% / 25%
R(2+1)D Max 34% / 33% 39% / 31%
MViT Attention 50% / 41% 46% / 34%

Every additional camera view increases top-1 accuracy (e.g., Live+Replay1: 0.50/0.43, Live+R1+R2: 0.57/0.40 for foul type/severity in (Held et al., 2023)). Attention-based aggregation provides a +5 percentage point gain in foul-type accuracy over max pooling at negligible parameter cost (+0.1%).

5. Ablation Studies and Human Performance Benchmarking

Ablation studies reveal:

  • The choice of backbone model strongly influences classification accuracy: transitioning from ResNet to R(2+1)D to MViT provides +16–18 points for foul-type classification.
  • Max pooling slightly outperforms mean pooling, but attention-based aggregation further improves both tasks.
  • Multi-task training (joint foul and severity heads) provides a 0.05 absolute accuracy gain for severity.
  • Shorter, higher-fps clips (1 s @16 fps) concentrate relevant cues around the contact frame, outperforming longer (5 s) or slower-fps videos.
  • Performance plateaus after utilizing ~80% of available training data, with offense severity task learning more slowly due to higher visual variability.

Human benchmarks on a 77-clip subset show players (75%) and referees (70%) outperform VARS (60%) on foul-type, but VARS matches are closer on severity (Players 58%, Refs 60%, VARS 51%). Decision latency varies by orders of magnitude: human participants require 38–41.5 s per clip, while VARS infers in 0.12 s (Held et al., 2024). Inter-rater agreement for humans is weak (κ ≈ 0.21) and consensus is rare, highlighting subjectivity in ground-truth labels.

Attention qualitative analysis demonstrates the model prioritizes close-up replay views, in line with referee behavior (Held et al., 2024).

6. Limitations, Open Issues, and Future Directions

The current VARS implementation and the broader domain face several constraints:

  • Camera feeds are uncalibrated and synchronization is manual; no explicit geometric calibration or autotemporal alignment is performed, precluding use of metric field measurements or real distances.
  • The dataset exhibits significant class imbalance, particularly in rare but critical classes such as Red-card incidents.
  • Fusion strategies are largely limited to pooling and attention-based aggregation; more advanced transformer-style architectures may further enhance cross-view temporal reasoning (Held et al., 2023).
  • Generalization to settings with fewer replay views or differing camera placements remains unproven.
  • Automated multi-view synchronization and more robust multi-instance representation learning are identified as priorities for future research.

7. Significance and Outlook

Multi-view foul recognition, as instantiated in the VARS system, demonstrates the feasibility of near-real-time, semi-automated foul and sanction classification leveraging synchronized multi-angle broadcast footage. Attention-based feature aggregation yields interpretable, informative weighting of camera views and provides a measurable improvement over prior pooling methods. While current systems lag behind top human referees in raw accuracy, they offer substantial gains in speed and consistency, providing actionable tools for supporting officiating in both professional and amateur football environments. Continued dataset expansion—especially for infrequent high-severity events—and advances in automatic camera calibration and transformer-based aggregation architectures offer promising directions for achieving parity with expert human judgment (Held et al., 2024, Held et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-View Foul Recognition.