RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

Published 17 Apr 2026 in cs.CV and cs.CL | (2604.15736v1)

Abstract: While Multimodal LLMs (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a multi-sport benchmark for evaluating video MLLMs on rule-based officiating tasks across 11 sports.
It employs curated match segments and expert-authored QA pairs to assess foul detection, temporal localization, and entity recognition.
Results reveal that while models perform well in incident detection, they consistently struggle with sport-specific rule reasoning and temporal grounding.

RefereeBench: Evaluating Video MLLMs for Multi-Sport Automatic Refereeing

Introduction

The paper "RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees" (2604.15736) establishes a new evaluation paradigm for Multimodal LLMs (MLLMs) within the context of automatic sports refereeing. Unlike previous efforts predominantly focused on action recognition or isolated single-sport benchmarks, this work introduces the first large-scale, multi-sport benchmark tailored for systematic, rule-grounded officiating assessment. The benchmark spans 11 sports, comprising 925 curated match segments and 6,475 expert-authored multiple-choice QA pairs covering fine-grained officiating concepts, including foul detection, classification, rationale inference, entity perception, and temporal localization.

Figure 1: RefereeBench samples require fine-grained perception, temporal grounding, and rule-based decision-making with certified referee-curated annotations.

Benchmark Design and Dataset Properties

Construction Pipeline

RefereeBench is constructed through a meticulous pipeline encompassing initial large-scale video collection, careful semantic localization using ASR-based keyword identification, followed by multi-stage, double-blind human verification. Qualified national-level referees manually annotate structured event metadata and generate corresponding QA pairs, ensuring precise rule alignment and multimodal evidence dependency. Each QA is peer-reviewed, and textual-only bias is proactively minimized via LLM ablation analyses.

Dataset Statistics and Coverage

The benchmark encompasses 11 distinct sports—soccer, basketball, volleyball, tennis, table tennis, badminton, handball, field hockey, ice hockey, water polo, and short-track speed skating. These cover 64 unique foul types and 34 distinctive penalty classes, with substantial inter-sport heterogeneity in both match durations and event structure.

Figure 2: RefereeBench covers 11 sports, 64 foul types, and a spectrum of video lengths, supporting robust multi-sport evaluation.

Each video is paired with seven hierarchical QA types: foul existence, foul classification, foul reasoning, penalty classification, penalty reasoning, entity identification, and temporal event localization. The design supports granular dissection of model capacities, from low-level perception to intricate, rule-based multimodal reasoning.

Evaluation Protocol and Model Benchmarks

A comprehensive suite of closed-source (e.g., GPT-5, Gemini-3, Claude-4.5, Doubao-Seed-1.8) and open-source (e.g., Qwen3-VL, InternVL3.5, VideoLLaMA3, LLaVA-Video) video MLLMs are benchmarked under controlled multiple-choice decoding, with video inputs standardized to 720p and frame or full video input variants.

Results: Analysis and Findings

Overall Model Performance

The leading closed-source MLLMs (Doubao-Seed-1.8, Gemini-3 Pro/Flash) obtain only 60–61% average accuracy across RefereeBench; GPT-5 lags at 54.8%, while open-source SOTA Qwen3-VL achieves just 47.1%. Strongest models excel modestly in incident existence (up to 92.5%) and entity perception (up to 79.2%); however, performance deteriorates significantly for sport-specific rule reasoning (often near or below 50%) and precise temporal grounding.

Figure 3: Performance breakdown per sport shows high variance in accuracy and highlights uneven generalization and sport-dependent difficulty.

Prominent inter-sport performance variation is observed, with models generalizing better in sports like ice hockey and soccer, but consistently failing in contexts such as water polo and badminton. This exposes core limitations in both learned visual-semantic representations and rule application.

Case Studies: Domain-Specific Reasoning Challenges

Representative event instances highlight the operational demands of rule-grounded decision-making: models must differentiate between visually similar but legally distinct contacts, correctly classify infraction types, map physical events to penalty logic, and justify discrete temporal segmentation. For example, correctly adjudicating "Tripping" in ice hockey or distinguishing a "Service Foul" in table tennis requires sport-specific legal definitions tied to precise temporal cues.

Figure 4: In ice hockey, models must identify spatial interaction and infraction reasoning for 'Tripping'.

Figure 5: For soccer, mapping actions to 'Unsporting behaviour' and determining correct penalty signage is required.

The necessity for robust multimodal reasoning and precise temporal event grounding is consistent across all sports, as further illustrated:

(Figure 6): Lane change infraction in skating
(Figure 7): Service error in table tennis
(Figure 8): Push foul in handball
(Figure 9): Double bounce call in tennis
(Figure 10): Intentional foul in field hockey
(Figure 11): Net touch in volleyball
(Figure 12): Blocking violation in basketball
(Figure 13): Service fault in badminton
(Figure 14): Ordinary foul in water polo

Detailed Error Analysis

Modality and Multimodal Reasoning

Incorporating audio yields consistent performance gains (up to +17% absolute), underscoring the importance of multimodal cues (e.g., whistle signals, commentator events) for robust decision-making. Nonetheless, even with audio, neither proprietary nor open-source MLLMs reach human-expert-level reliable performance.

Rule Knowledge and RAG Integration

Text-only professional referee exam results confirm that leading models possess partial rule knowledge (e.g., 69% for Gemini-3-Flash, 54% for Qwen3-VL), but exhibit pronounced deficits in game management and execution dimensions. Naive retrieval-augmented generation (RAG) over official rulebooks fails to close the gap; in some cases, model performance deteriorates as irrelevant or ambiguously retrieved text can anchor models to incorrect decisions—highlighting the necessity for deeper multi-hop, visually-grounded rule comprehension.

Bias and Robustness

Models display a systematic tendency to over-call fouls, with misidentification rates for negative samples (i.e., correct legal actions) exceeding 37–63% in neutral settings and surging higher under suggestive questioning. Sensitivity to adversarial or leading context points to insufficient robustness and calibration, a critical limitation for real-world officiating scenarios.

Implications, Limitations, and Future Directions

Practical Consequences

Current MLLMs are not deployment-ready for trustworthy, automated refereeing in either professional or amateur sports settings. While some components—entity recognition, event spotting—approach functional adequacy, the central competencies of temporal localization and legal rule application remain underdeveloped and error-prone.

Theoretical Implications

The findings expose the boundaries of parametric multimodal learning vis-à-vis structured, domain-specific reasoning. Scaling model or data size alone does not resolve the disconnect between perceptual event understanding and actionable, rule-compliant decision-making. Progress will require architectures capable of tightly integrating physical evidence, temporally synchronized semantics, and explicit symbolic reasoning over sport rulebooks.

Future Research Directions

Promising research directions include:

Enhanced multimodal pretraining with explicit sports officiating supervision
Visual chain-of-thought approaches for transparent, stepwise event-to-rule reasoning
Agentic pipeline architectures decomposing the task into incident detection, candidate classification, and rule validation modules
Data augmentation spanning underrepresented sports, interaction patterns, and evidence types to promote generalization and debiasing
Reinforcement learning leveraging human-in-the-loop calibration for robustness against bias and adversarial context

Conclusion

RefereeBench establishes a rigorous, expert-verified testbed for evaluating the hierarchical reasoning capacities of MLLMs in multi-sport automatic refereeing. Present generation models—despite demonstrating impressive general multimedia capabilities—fall short of the reliability threshold required for real-world officiating, especially in high-stakes or ambiguous scenarios. The benchmark provides a critical evaluation lens and foundation for subsequent research into trustworthy, rule-grounded, and interpretable multimodal AI systems for complex real-world decision-making (2604.15736).

Markdown Report Issue