CricBench: Cricket Analytics Benchmark Suite
- CricBench is a benchmark suite that evaluates language and vision models on cricket data using tailored tasks like Text-to-SQL, VQA, and commentary-to-table conversion.
- It focuses on domain logic including multi-table joins, temporal reasoning, and handling code-mixed queries to reflect authentic cricket analytics challenges.
- The suite integrates diverse evaluation protocols and metrics, ensuring rigorous testing of model robustness across SQL, visual, and dynamic commentary-based tasks.
CricBench is a suite of task-specific evaluation benchmarks designed to probe the reasoning, robustness, and cross-lingual competence of LLMs and vision-LLMs (VLMs) in the domain of cricket analytics. This suite targets the bottleneck where general NLP and vision-NLP approaches fail to capture the domain-specific logic, evolving schema, and high linguistic diversity inherent in cricket data and its analysis. CricBench includes the original CricBench Text-to-SQL benchmark (Devraj et al., 26 Dec 2025), the MMCricBench-3K VQA benchmark (Gautam et al., 24 Aug 2025), and is complemented by the CMT-Bench for text-to-table robustness analysis (Upadhyay et al., 20 Oct 2025). These benchmarks collectively establish a rigorous framework for evaluating how well LLMs and VLMs handle granular statistical queries, visual data, and dynamic state summarization grounded in real-world cricket contexts.
1. Motivation and Scope of CricBench
CricBench was developed to address deficiencies in existing NL-to-SQL and vision–language evaluation, particularly the lack of real-world, domain-specialized, and linguistically diverse challenge sets for sports analytics. With cricket commanding a global following of over 2.5 billion, the analytic queries required by both enthusiasts and professionals—such as cross-season performance trends, phased metrics (e.g., economy rate in “death overs”), and nuanced player comparisons—cannot be satisfied by generic benchmarks or web searches.
Prior benchmarks (WikiSQL, Spider, BIRD) are cross-domain and almost exclusively monolingual (English), and do not evaluate:
- Domain logic (e.g., handling “death overs,” franchise renaming, or debut filtering)
- Temporal reasoning (e.g., performance in last n matches, date ranges)
- Multilingual or code-mixed queries essential in cricket-rich regions (e.g., India)
CricBench remedies these gaps by supplying (i) an expertly-curated, high-complexity Text-to-SQL test set focused on cricket, (ii) a realistic, normalized database schema capturing the IPL ball-by-ball record, and (iii) platform-specific extensions testing visual, tabular, and commentary-driven analytics (Devraj et al., 26 Dec 2025, Gautam et al., 24 Aug 2025, Upadhyay et al., 20 Oct 2025).
2. Benchmark Architectures and Data Design
2.1 Text-to-SQL Benchmark
The original CricBench Text-to-SQL dataset consists of 1,169 IPL matches (2008–2024), modeled in a 5-table normalized SQLite schema:
| Table | Key Fields | Relationships |
|---|---|---|
| Matches | match_id (PK), match_date, venue, result, … | — |
| Deliveries | delivery_id (PK), match_id (FK), over_number, ball_number, runs_scored, … | match_id→Matches, bowler_id→Players |
| Players | player_id (PK), player_name, country, playing_role, … | — |
| PlayerInMatch | pim_id (PK), match_id (FK), player_id (FK), team_name, … | match_id→Matches, player_id→Players |
| FielderDismissals | fd_id (PK), delivery_id (FK), fielder_id (FK) | delivery_id→Deliveries, fielder_id→Players |
Complex queries (200 total) are authored and validated by cricket and SQL experts, with characteristics:
- 75% contain multi-table joins
- 42.5% utilize nested queries/CTEs
- 85% employ aggregation (GROUP BY/HAVING)
- 59.5% require temporal filtering
- 39.5% involve franchise normalization
- 21.5% compute derived metrics (e.g., economy rate)
Example SQL: Compute Jasprit Bumrah’s economy rate in death overs for Mumbai Indians:
1 2 3 4 5 6 7 8 9 |
SELECT ROUND((SUM(d.runs_scored+d.extra_runs)*6.0)/SUM(CASE WHEN d.wides=0 AND d.noballs=0 THEN 1 ELSE 0 END),2) AS Economy_Rate FROM Deliveries d JOIN PlayerInMatch pim ON d.bowler_id = pim.player_id AND pim.team_name='Mumbai Indians' WHERE pim.player_id = ( SELECT player_id FROM Players WHERE player_name='Jasprit Bumrah' ) AND d.over_number BETWEEN 16 AND 20; |
2.2 MMCricBench-3K Visual Question Answering
MMCricBench-3K extends CricBench to visual question answering over synthetic images of cricket scorecards from ODI, T20, and Test formats (Gautam et al., 24 Aug 2025). The dataset comprises:
- 1,463 PNG scorecard images rendered via HTML/CSS→PDF pipelines
- English (Latin script) and Hindi (Devanagari script) versions, each with 1,500 QA pairs
- Questions span direct retrieval (C1), basic arithmetic (C2), and multi-step/quantitative analysis (C3)
Key challenges include structure-aware OCR, arithmetic reasoning, and cross-script robustness. Answer types include binary, categorical, numeric, and open-ended string responses.
2.3 CMT-Bench for Text-to-Table Generation
CMT-Bench is built from live ball-by-ball commentary, requiring LLMs to dynamically generate two parallel tables (batsmen and bowlers) that evolve throughout the match (Upadhyay et al., 20 Oct 2025). Perturbation regimes are incorporated to audit model robustness:
- Extractive-cue ablation (removal of summary tuples)
- Temporal prefix shifts (varying context length via truncated ball sequences)
- Entity-form changes (anonymization, out-of-distribution substitution, entity entanglement)
The ground-truth tables are derived from 498 ODI and 116 T20 innings, yielding a corpus of 1,632 commentary samples and 3,264 generated tables.
3. Evaluation Protocols and Metrics
Across the CricBench family, rigorous evaluation regimes are employed. For the Text-to-SQL suite (Devraj et al., 26 Dec 2025):
- Execution Accuracy (ExecAcc):
- Schema Compliance: Strict exact match at the output schema level (column names and order)
- Data Match Accuracy (Match): Row-level content:
Numeric fields must be within a ±1.0 tolerance.
Two evaluation modes are used:
- Raw Zero-shot: Model sees only schema and NL question
- Context-Aware: Model gets the output of a deterministic Complexity Router, which prepends a global preamble (imperative calculation rules) and injects dynamic context rules (e.g., team name mappings), plus a critical schema constraint
For MMCricBench-3K (Gautam et al., 24 Aug 2025):
- Exact Match Accuracy (EM) over answer types, stratified by reasoning category
- Cross-lingual Gap: Difference in accuracy between English and Hindi subsets
For CMT-Bench (Upadhyay et al., 20 Oct 2025):
- Cell-level, row-level, and column-level accuracy post row alignment via the Hungarian algorithm
- Distributional analyses: Numeric error statistics (-scores) and energy distance-based permutation testing under perturbations
4. Model Benchmarks and Findings
4.1 Text-to-SQL Evaluation
Six LLMs are evaluated contextualized and raw (Devraj et al., 26 Dec 2025):
| Model | Context-Aware Match (%) |
|---|---|
| DeepSeek R1 | 50.6 |
| Claude 3.7 Sonnet | 47.7 |
| GPT-4o | 33.7 |
| Qwen 2.5 (7B) | 4.8 |
| Llama 3.1 (8B) | 2.8 |
| Gemma 2 (9B) | 3.9 |
- Context injection yields significant improvement; e.g., GPT-4o increases from 19.6% (raw) to 33.7% (context-aware).
- Open-weight DeepSeek R1 outperforms proprietary models.
- Absolute performance remains capped (~50%) on complex domain logic.
4.2 Domain Adaptation Gaps
Context-aware performance on CricBench is consistently lower than on prior general-domain tasks (BIRD), indicating a persistent adaptation gap, especially in open-source models:
| Model | BIRD Match (%) | CricBench Match (%) | Gap (pp) |
|---|---|---|---|
| DeepSeek R1 | 55.0 | 50.6 | -4.4 |
| Claude 3.7 Sonnet | 51.7 | 47.7 | -4.0 |
| GPT-4o | 55.4 | 33.7 | -21.7 |
| Qwen 2.5 | 42.4 | 4.8 | -37.6 |
| Gemma 2 | 38.3 | 3.9 | -34.4 |
| Llama 3.1 | 39.7 | 2.8 | -36.9 |
- Proprietary models exhibit modest degradation (<5pp); open models lose up to 40pp.
4.3 Multilingual and Code-Mixed Input
Code-mixed Hindi prompts yield equal or superior accuracy to English, contradicting the assumption that English is optimal for specialized SQL tasks. For instance, DeepSeek R1 achieves 55.0% on Hindi inputs versus 51.9% on English. A plausible implication is that retaining technical terms in English provides schema linkage anchors, while Hindi grammar may reduce language ambiguity (Devraj et al., 26 Dec 2025).
4.4 VQA and T2T Challenges
In MMCricBench-3K (Gautam et al., 24 Aug 2025), VLMs such as GPT-4o and Qwen2.5VL-7B attain, for single-image QA, 49.1–57.3% (English) and 42.6–45.1% (Hindi), with a drop of 4–12pp on Hindi. Performance degrades further for multi-image reasoning and on complex arithmetic/multi-step QA. Chain-of-thought prompting offers mixed improvements.
For CMT-Bench (Upadhyay et al., 20 Oct 2025), ablation of extractive summary cues results in drastic accuracy drops (e.g., Gemini-2.5: batsman cell accuracy 89%→49%). Entity-form perturbations (notably entanglement) are most damaging, effecting 15–20pp cell-level accuracy drops.
5. Limitations, Extensions, and Future Directions
CricBench and related datasets are currently limited by:
- Coverage constrained to IPL (for structured Text-to-SQL); no live queries or inclusion of Tests, T20 domestic leagues
- Multilingual extension is restricted to Hindi; broader support (e.g., Tamil, Bengali) is planned
- In MMCricBench-3K, only synthetic images are used; introduction of real-world scanned documents could increase utility
Potential avenues for expansion include:
- Dynamic entity resolution (venues, umpires), and more fine-grained symbolic reasoning
- Integration of retrieval-augmented modules for live data and further architectural modifications for better chain-of-thought and schema-aware inference
- Pretraining of VLMs on multi-script table images; hybrid symbolic–neural architectures for table extraction
CricBench establishes itself as a comprehensive benchmark for evaluating LLM/VLM capabilities in handling domain-specific, high-complexity, and multilingual cricket analytics. It exposes the limitations of general-domain models and provides a structured challenge suite for next-generation research in specialized knowledge interfaces (Devraj et al., 26 Dec 2025, Gautam et al., 24 Aug 2025, Upadhyay et al., 20 Oct 2025).
6. Context within the Broader Evaluation Ecosystem
CricBench complements, but sharply diverges from, existing benchmarks by prioritizing:
- Task realism (all queries grounded in authentic cricket use-cases and real-world schemas)
- Multilingual and code-mixed settings reflecting actual end-user behavior
- Rigorous context-aware evaluation and tailored perturbations (especially in T2T tasks, where entity and cue manipulations are key to robustness auditing)
As such, CricBench serves both as an analytical tool to uncover the “domain gap” in LLM/VLM deployment and a concrete testbed for method development in specialized, multilingual sports analytics and beyond.