CricBench: Cricket Analytics Benchmark Suite

Updated 2 January 2026

CricBench is a benchmark suite that evaluates language and vision models on cricket data using tailored tasks like Text-to-SQL, VQA, and commentary-to-table conversion.
It focuses on domain logic including multi-table joins, temporal reasoning, and handling code-mixed queries to reflect authentic cricket analytics challenges.
The suite integrates diverse evaluation protocols and metrics, ensuring rigorous testing of model robustness across SQL, visual, and dynamic commentary-based tasks.

CricBench is a suite of task-specific evaluation benchmarks designed to probe the reasoning, robustness, and cross-lingual competence of LLMs and vision-LLMs (VLMs) in the domain of cricket analytics. This suite targets the bottleneck where general NLP and vision-NLP approaches fail to capture the domain-specific logic, evolving schema, and high linguistic diversity inherent in cricket data and its analysis. CricBench includes the original CricBench Text-to-SQL benchmark (Devraj et al., 26 Dec 2025), the MMCricBench-3K VQA benchmark (Gautam et al., 24 Aug 2025), and is complemented by the CMT-Bench for text-to-table robustness analysis (Upadhyay et al., 20 Oct 2025). These benchmarks collectively establish a rigorous framework for evaluating how well LLMs and VLMs handle granular statistical queries, visual data, and dynamic state summarization grounded in real-world cricket contexts.

1. Motivation and Scope of CricBench

CricBench was developed to address deficiencies in existing NL-to-SQL and vision–language evaluation, particularly the lack of real-world, domain-specialized, and linguistically diverse challenge sets for sports analytics. With cricket commanding a global following of over 2.5 billion, the analytic queries required by both enthusiasts and professionals—such as cross-season performance trends, phased metrics (e.g., economy rate in “death overs”), and nuanced player comparisons—cannot be satisfied by generic benchmarks or web searches.

Prior benchmarks (WikiSQL, Spider, BIRD) are cross-domain and almost exclusively monolingual (English), and do not evaluate:

Domain logic (e.g., handling “death overs,” franchise renaming, or debut filtering)
Temporal reasoning (e.g., performance in last n matches, date ranges)
Multilingual or code-mixed queries essential in cricket-rich regions (e.g., India)

CricBench remedies these gaps by supplying (i) an expertly-curated, high-complexity Text-to-SQL test set focused on cricket, (ii) a realistic, normalized database schema capturing the IPL ball-by-ball record, and (iii) platform-specific extensions testing visual, tabular, and commentary-driven analytics (Devraj et al., 26 Dec 2025, Gautam et al., 24 Aug 2025, Upadhyay et al., 20 Oct 2025).

2. Benchmark Architectures and Data Design

2.1 Text-to-SQL Benchmark

The original CricBench Text-to-SQL dataset consists of 1,169 IPL matches (2008–2024), modeled in a 5-table normalized SQLite schema:

Table	Key Fields	Relationships
Matches	match_id (PK), match_date, venue, result, …	—
Deliveries	delivery_id (PK), match_id (FK), over_number, ball_number, runs_scored, …	match_id→Matches, bowler_id→Players
Players	player_id (PK), player_name, country, playing_role, …	—
PlayerInMatch	pim_id (PK), match_id (FK), player_id (FK), team_name, …	match_id→Matches, player_id→Players
FielderDismissals	fd_id (PK), delivery_id (FK), fielder_id (FK)	delivery_id→Deliveries, fielder_id→Players

Complex queries (200 total) are authored and validated by cricket and SQL experts, with characteristics:

75% contain multi-table joins
42.5% utilize nested queries/CTEs
85% employ aggregation (GROUP BY/HAVING)
59.5% require temporal filtering
39.5% involve franchise normalization
21.5% compute derived metrics (e.g., economy rate)

Example SQL: Compute Jasprit Bumrah’s economy rate in death overs for Mumbai Indians:

SELECT
  ROUND((SUM(d.runs_scored+d.extra_runs)*6.0)/SUM(CASE WHEN d.wides=0 AND d.noballs=0 THEN 1 ELSE 0 END),2)
  AS Economy_Rate
FROM Deliveries d JOIN PlayerInMatch pim
  ON d.bowler_id = pim.player_id AND pim.team_name='Mumbai Indians'
WHERE pim.player_id = (
    SELECT player_id FROM Players WHERE player_name='Jasprit Bumrah'
)
  AND d.over_number BETWEEN 16 AND 20;

All queries are provided in parallel English and carefully code-mixed Hindi, with cricket terms like “Economy Rate” retained in English to capture authentic multilingual query distributions.

2.2 MMCricBench-3K Visual Question Answering

MMCricBench-3K extends CricBench to visual question answering over synthetic images of cricket scorecards from ODI, T20, and Test formats (Gautam et al., 24 Aug 2025). The dataset comprises:

1,463 PNG scorecard images rendered via HTML/CSS→PDF pipelines
English (Latin script) and Hindi (Devanagari script) versions, each with 1,500 QA pairs
Questions span direct retrieval (C1), basic arithmetic (C2), and multi-step/quantitative analysis (C3)

Key challenges include structure-aware OCR, arithmetic reasoning, and cross-script robustness. Answer types include binary, categorical, numeric, and open-ended string responses.

2.3 CMT-Bench for Text-to-Table Generation

CMT-Bench is built from live ball-by-ball commentary, requiring LLMs to dynamically generate two parallel tables (batsmen and bowlers) that evolve throughout the match (Upadhyay et al., 20 Oct 2025). Perturbation regimes are incorporated to audit model robustness:

Extractive-cue ablation (removal of summary tuples)
Temporal prefix shifts (varying context length via truncated ball sequences)
Entity-form changes (anonymization, out-of-distribution substitution, entity entanglement)

The ground-truth tables are derived from 498 ODI and 116 T20 innings, yielding a corpus of 1,632 commentary samples and 3,264 generated tables.

3. Evaluation Protocols and Metrics

Across the CricBench family, rigorous evaluation regimes are employed. For the Text-to-SQL suite (Devraj et al., 26 Dec 2025):

Execution Accuracy (ExecAcc):

$\text{ExecAcc} = \frac{|\{q\in Q : \text{SQL}(q)\ \text{executes cleanly}\}|}{|Q|} \times 100\%$

Schema Compliance: Strict exact match at the output schema level (column names and order)
Data Match Accuracy (Match): Row-level content:

$\text{Match} = \frac{|\{q \in Q : \text{generated\_rows}(q) \equiv \text{gold\_rows}(q)\}|}{|Q|}\times 100\%$

Numeric fields must be within a ±1.0 tolerance.

Two evaluation modes are used:

Raw Zero-shot: Model sees only schema and NL question
Context-Aware: Model gets the output of a deterministic Complexity Router, which prepends a global preamble (imperative calculation rules) and injects dynamic context rules (e.g., team name mappings), plus a critical schema constraint

For MMCricBench-3K (Gautam et al., 24 Aug 2025):

Exact Match Accuracy (EM) over answer types, stratified by reasoning category
Cross-lingual Gap: Difference in accuracy between English and Hindi subsets

For CMT-Bench (Upadhyay et al., 20 Oct 2025):

Cell-level, row-level, and column-level accuracy post row alignment via the Hungarian algorithm
Distributional analyses: Numeric error statistics ( $z$ -scores) and energy distance-based permutation testing under perturbations

4. Model Benchmarks and Findings

4.1 Text-to-SQL Evaluation

Six LLMs are evaluated contextualized and raw (Devraj et al., 26 Dec 2025):

Model	Context-Aware Match (%)
DeepSeek R1	50.6
Claude 3.7 Sonnet	47.7
GPT-4o	33.7
Qwen 2.5 (7B)	4.8
Llama 3.1 (8B)	2.8
Gemma 2 (9B)	3.9

Context injection yields significant improvement; e.g., GPT-4o increases from 19.6% (raw) to 33.7% (context-aware).
Open-weight DeepSeek R1 outperforms proprietary models.
Absolute performance remains capped (~50%) on complex domain logic.

4.2 Domain Adaptation Gaps

Context-aware performance on CricBench is consistently lower than on prior general-domain tasks (BIRD), indicating a persistent adaptation gap, especially in open-source models:

Model	BIRD Match (%)	CricBench Match (%)	Gap (pp)
DeepSeek R1	55.0	50.6	-4.4
Claude 3.7 Sonnet	51.7	47.7	-4.0
GPT-4o	55.4	33.7	-21.7
Qwen 2.5	42.4	4.8	-37.6
Gemma 2	38.3	3.9	-34.4
Llama 3.1	39.7	2.8	-36.9

Proprietary models exhibit modest degradation (<5pp); open models lose up to 40pp.

4.3 Multilingual and Code-Mixed Input

Code-mixed Hindi prompts yield equal or superior accuracy to English, contradicting the assumption that English is optimal for specialized SQL tasks. For instance, DeepSeek R1 achieves 55.0% on Hindi inputs versus 51.9% on English. A plausible implication is that retaining technical terms in English provides schema linkage anchors, while Hindi grammar may reduce language ambiguity (Devraj et al., 26 Dec 2025).

4.4 VQA and T2T Challenges

In MMCricBench-3K (Gautam et al., 24 Aug 2025), VLMs such as GPT-4o and Qwen2.5VL-7B attain, for single-image QA, 49.1–57.3% (English) and 42.6–45.1% (Hindi), with a drop of 4–12pp on Hindi. Performance degrades further for multi-image reasoning and on complex arithmetic/multi-step QA. Chain-of-thought prompting offers mixed improvements.

For CMT-Bench (Upadhyay et al., 20 Oct 2025), ablation of extractive summary cues results in drastic accuracy drops (e.g., Gemini-2.5: batsman cell accuracy 89%→49%). Entity-form perturbations (notably entanglement) are most damaging, effecting 15–20pp cell-level accuracy drops.

5. Limitations, Extensions, and Future Directions

CricBench and related datasets are currently limited by:

Coverage constrained to IPL (for structured Text-to-SQL); no live queries or inclusion of Tests, T20 domestic leagues
Multilingual extension is restricted to Hindi; broader support (e.g., Tamil, Bengali) is planned
In MMCricBench-3K, only synthetic images are used; introduction of real-world scanned documents could increase utility

Potential avenues for expansion include:

Dynamic entity resolution (venues, umpires), and more fine-grained symbolic reasoning
Integration of retrieval-augmented modules for live data and further architectural modifications for better chain-of-thought and schema-aware inference
Pretraining of VLMs on multi-script table images; hybrid symbolic–neural architectures for table extraction

CricBench establishes itself as a comprehensive benchmark for evaluating LLM/VLM capabilities in handling domain-specific, high-complexity, and multilingual cricket analytics. It exposes the limitations of general-domain models and provides a structured challenge suite for next-generation research in specialized knowledge interfaces (Devraj et al., 26 Dec 2025, Gautam et al., 24 Aug 2025, Upadhyay et al., 20 Oct 2025).

6. Context within the Broader Evaluation Ecosystem

CricBench complements, but sharply diverges from, existing benchmarks by prioritizing:

Task realism (all queries grounded in authentic cricket use-cases and real-world schemas)
Multilingual and code-mixed settings reflecting actual end-user behavior
Rigorous context-aware evaluation and tailored perturbations (especially in T2T tasks, where entity and cue manipulations are key to robustness auditing)

As such, CricBench serves both as an analytical tool to uncover the “domain gap” in LLM/VLM deployment and a concrete testbed for method development in specialized, multilingual sports analytics and beyond.

Markdown Upgrade to Chat

References (3)

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics (2025)

Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs (2025)

CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CricBench.

CricBench: Cricket Analytics Benchmark Suite

1. Motivation and Scope of CricBench

2. Benchmark Architectures and Data Design

2.1 Text-to-SQL Benchmark

2.2 MMCricBench-3K Visual Question Answering

2.3 CMT-Bench for Text-to-Table Generation

3. Evaluation Protocols and Metrics

4. Model Benchmarks and Findings

4.1 Text-to-SQL Evaluation

4.2 Domain Adaptation Gaps

4.3 Multilingual and Code-Mixed Input

4.4 VQA and T2T Challenges

5. Limitations, Extensions, and Future Directions

6. Context within the Broader Evaluation Ecosystem

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

CricBench: Cricket Analytics Benchmark Suite

1. Motivation and Scope of CricBench

2. Benchmark Architectures and Data Design

2.1 Text-to-SQL Benchmark

2.2 MMCricBench-3K Visual Question Answering

2.3 CMT-Bench for Text-to-Table Generation

3. Evaluation Protocols and Metrics

4. Model Benchmarks and Findings

4.1 Text-to-SQL Evaluation

4.2 Domain Adaptation Gaps

4.3 Multilingual and Code-Mixed Input

4.4 VQA and T2T Challenges

5. Limitations, Extensions, and Future Directions

6. Context within the Broader Evaluation Ecosystem

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research