GRAB Dataset: Graph Reasoning Benchmark

Updated 30 June 2025

GRAB dataset is a synthetic benchmark focused on evaluating graph analytical reasoning and quantitative skills of large multimodal models.
It comprises 2170 controlled questions over diverse graph properties, including intercepts, gradients, and trigonometric parameters.
Benchmarking with GRAB reveals significant challenges for current models, highlighting gaps in algebraic and analytical comprehension.

GRAB Dataset

The GRAB (GRaph Analysis Benchmark) dataset is an advanced synthetic benchmark specifically designed for evaluating the capabilities of large multimodal models (LMMs) in analytical graph reasoning. Unlike common visual VQA or optical character recognition (OCR) benchmarks, GRAB targets the ability of LMMs to perform quantitative and algebraic reasoning over graphs and figures, offering a level of challenge that current models cannot solve reliably. The benchmark is positioned as a next-generation testbed for understanding and driving progress in LMMs’ mathematical and graphical analysis skills.

1. Structure and Content

GRAB comprises 2170 carefully constructed questions over a suite of synthetic graphs, each generated for controlled complexity and diversity:

Question Types: All questions are single-answer, open-ended (not multiple-choice, except in ablations), and require the model to supply the exact numerical answer with specified precision (integer, one decimal, nearest 10, etc.).
Core Tasks: Four principal task types are present:
1. Properties: Analysis of individual properties from a single function or data series, such as intercepts, gradients, or area bounded.
2. Functions: Estimation of the mean value of a property across multiple plotted functions (up to 10 per graph).
3. Series: Calculating the mean of a property over multiple data series (up to 10 per graph).
4. Transforms: Determining the value of a property after a sequence of up to 10 mathematical graph transformations (e.g., rotations, shifts, scaling, reflection).
Properties Span: 23 analytical properties across categories such as intercepts, gradients, extrema, area under curve, trigonometric parameters, number of series, correlation coefficients, measures of spread (mean, variance, range), and function equation extraction.
Image Generation and Control: All graphs and questions are generated synthetically using Matplotlib, ensuring full control over distribution of correct answers, precision demands, and visual complexity.

A table summarizing covered categories and selected properties:

Category	Properties (examples)
Intercepts	x-intercept, y-intercept
Gradients/Slopes	Gradient, stationary points
Trigonometric	Amplitude, period, vertical shift
Correlation	Pearson, Spearman, Kendall correlations
Area	Area under the curve, net area bounded
Spread/Location	Mean, median, variance, interquartile range
Counting	Number of points, number of series
Function Extraction	Function equation (parameters)

OCR-style questions (e.g., reading axis labels, legends) are excluded from the main GRAB set, as modern LMMs already achieve near-perfect accuracy on them.

2. Benchmarking Protocol and Models

GRAB is intended as a fair, reproducible benchmark:

Model Coverage: 20 LMMs benchmarked, including closed-source (e.g., GPT-4, Claude, Gemini) and open-source models (OmniLMM, CogVLM, Llava, etc.). For all, instruction-tuned or chat versions are preferred to maximize adherence to task instructions.
Inference Setup: For closed models, APIs (Vertex AI, OpenAI, Reka) are used. For open models, local execution is performed via HuggingFace or OpenCompass.
Decoding: Greedy, deterministic decoding (temperature=0, top-k=1) is enforced for full comparability and reproducibility.
Prompt Design: All models are prompted with a minimal instruction: “Only provide the answer, no reasoning steps. If you are unsure, still provide an answer. Answer:”—ensuring the response is concise and precisely formatted.
Scoring: Exact-match (character-wise, whitespace stripped) grading is applied. Any deviation in number, units, or format counts as 0. No partial credit is given. Alternate scorers (LLM-based answer extractors) are tested in ablations and yield similar results.

3. Evaluation Results and Analysis

GRAB is found to be an extremely difficult benchmark for existing LMMs:

Best Model Performance: Claude 3.5 Sonnet achieves 21.7% overall exact-match accuracy across all questions—the highest among all evaluated models.
Task and Category Analysis: Properties tasks are the easiest (e.g., 41.8% accuracy for Claude 3.5 Sonnet), while Functions, Series, and especially Transforms are substantially harder, with all models falling below 15% on most metrics. No model was able to solve function parameter extraction questions correctly.
Open- vs Closed-Source Gap: The best open-source model (e.g., TransCore-M) scores less than 10%, reflecting a substantial capability gap.

A synthesized excerpt from the reported results table:

Model	Overall (%)	Properties	Functions	Series	Transforms
Claude 3.5 Sonnet	21.7	41.8	15.5	11.0	10.0
Gemini 1.5 Pro	18.1	34.2	11.4	13.3	6.5
GPT-4o	13.6	24.7	10.8	9.2	3.5
TransCore-M (open)	7.6	10.2	9.2	8.4	4.8

Ablation Findings

Instruction Adherence: Models failing to follow “succinct answer only” instructions are heavily penalized by the exact-match protocol, highlighting a nontrivial area of improvement for LMMs beyond analytical reasoning.
Question Complexity Sensitivity: Model performance degrades as the number of functions/series or required transformations increases within each question.
Multiple Choice Variant: When GRAB is rendered as 5-way adversarial multiple choice, accuracies for most models increase, but only to 30% for the best—still far from human proficiency.
OCR-Style Baseline: All top LMMs attain near-perfect accuracy on legend, axis, and simple label reading, confirming these are trivial tasks for current architectures.

4. Design Principles and Schematic Details

GRAB's methodology embodies several principled design criteria:

Synthetic, Configurable Complexity: All question generation is conducted programmatically (with Matplotlib and controlled random seeds), allowing precise control over answer distributions, visual features, and analytic challenge level.
Noise-Free Targets: Answers are evaluable to full precision, avoiding human label noise or ambiguity.
High Headroom: No model surpasses 22% accuracy, indicating that GRAB is unsolved and leaves substantial room for research progress in graphical, algebraic, and instruction-following capabilities.
Diagnostic Value: The breadth of analysis (per-task, per-property, per-complexity) enables identification of distinct model failure modes, such as lack of systematic reasoning, struggles with trigonometric/parameter inference, or multi-step functional transforms.

Example formula types found in GRAB tasks:

Gradient computation: $m = \frac{y_2 - y_1}{x_2 - x_1}$
Mean property over N series: $\text{Mean} = \frac{1}{N} \sum_{i=1}^N p_i$
Variance: $\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2$

5. Implications and Community Impact

GRAB is intended as a long-lasting diagnostic for the field:

Driving Model Development: By highlighting severe performance bottlenecks, GRAB is likely to shape LMM research agendas toward improved mathematical reasoning, data visualization literacy, and fine-grained instruction compliance.
Benchmark Longevity and Difficulty Ceiling: Unlike benchmarks such as MMLU or Visual Question Answering, on which current LMMs already approach or exceed human-level, GRAB presents unresolved analytical challenges.
Open Release and Reproducibility: GRAB’s generation code, questions, and evaluation protocol are released to enable transparent result reporting and iterative progress tracking.

6. Comparative Perspective

GRAB departs from existing figure-based and visual QA datasets in several key respects:

Aspect	GRAB	Prior Visual Tasks
Synthetic/Image Generation	Matplotlib, fully controlled	Mostly real-world or mixed
Targeted Reasoning	Analytical, quantitative	Caption/OCR/scene understanding
Answer Types	Numeric regression (strict)	Open-ended, multi-choice, short txt
Model Performance Headroom	<22% top-scoring model	80–95% for many leading tasks

This framework provides distinct incremental measurement as LMMs improve over coming years.

7. Future Directions and Extensions

The GRAB benchmark authors highlight potential for further synthetic task development, such as:

Extending to semantic graph understanding, causal inference, or multi-figure reasoning.
Automatic curriculum construction to facilitate adaptation and learning by LMMs.
Integration into larger multitask evaluation suites for AI intended to operate in scientific or quantitative domains.

The dataset is positioned as a frontier benchmark and is actively released to catalyze new research directions in this area.

PDF Markdown Chat (Pro)