GSM-Symbolic: A Rigorous LLM Math Benchmark

Updated 13 November 2025

GSM-Symbolic benchmark is a synthetic evaluation framework using parameterized symbolic templates to generate diverse arithmetic problems for assessing LLM mathematical reasoning.
It employs a formalized template generation protocol with controlled instantiations and constraint-based sampling to ensure high diversity and robustness testing.
The benchmark reveals model vulnerabilities by quantifying accuracy drops from clause modifications and distractor insertions, guiding improvements in LLM design.

GSM-Symbolic is a synthetically generated benchmark designed to provide a more rigorous and interpretable evaluation of the mathematical reasoning abilities of LLMs by extending the popular GSM8K grade-school arithmetic benchmark. Unlike GSM8K, which consists of a fixed set of problems with single canonical instantiations, GSM-Symbolic leverages parameterized symbolic templates to generate families of problems with controlled variations and diversity. This construction enables detailed analysis of model robustness to minor perturbations and isolates the surface complexity and distractor sensitivity of LLM mathematical reasoning.

1. Formal Characterization of Symbolic Templates

GSM-Symbolic employs a formalized template-based method to generate question–answer pairs. Each question is specified through a template $T : X \rightarrow Q$ , where $X = \{x_1, \ldots, x_n\}$ denotes symbolic placeholders for variables (names, object types, numeric parameters), and $Q$ is a well-formed natural language question. Instantiation occurs via a function $f_T : X \rightarrow \mathbb{R}^n$ , which samples concrete values $\mathbf{x} = (x_1,\ldots,x_n)$ from prescribed domains according to problem-specific constraints (e.g., $x_1+x_2+x_3+\mathrm{ans}= \text{total}$ ).

Each instantiated question $Q = T(\mathbf{x})$ maps deterministically to a ground-truth answer $\mathrm{ans} = g_T(\mathbf{x})$ as computed by the symbolic logic embedded in the template. For each template, $N$ valid instantiations are generated, resulting in $N$ unique question–answer pairs per template, yielding a comprehensive corpus.

2. Template Generation and Corpus Construction

The GSM-Symbolic corpus is constructed through the following protocol:

Seed selection: 100 diverse problems sampled from the GSM8K test set, representing basic arithmetic and multi-step composition.
Manual annotation: Each seed is converted into a symbolic template $T(X)$ , with all concrete terms replaced by symbolic placeholders. All required logical and arithmetic constraints (e.g., sum constraints, divisibility) are codified.
Domain specification and sampling: For each variable $X_j$ , sampling ranges or sets are specified. 50 valid instantiations per template are drawn with rejection sampling ensuring constraint satisfaction.
Diversity and integrity checks: Automated checks ensure value leakage does not occur, instantiated answers match original seeds, and syntactic validity is preserved; human reviewers spot-check 10 examples for each template.
Final data composition: The dataset consists of $100$ templates $\times$ $50$ instantiations, amounting to $5{,}000$ unique problem instances.

This procedure ensures high coverage of the arithmetic operation space and systematic control over question structure, facilitating precise robustness analyses.

3. Comparison to GSM8K and Sibling Benchmarks

GSM8K consists of approximately $8{,}000$ fixed-grade school problems, each with canonical values, restricting its ability to probe modeling sensitivity to superficial changes. Major limitations include:

Lack of variability: One question, one solution; model brittleness and generalization are concealed.
Potential contamination: Since GSM8K’s test set is fixed, instances may overlap or leak into LLM training corpora.
No post-hoc difficulty modulation.

GSM-Symbolic directly addresses these drawbacks:

Controllable numeric variation: Models are assessed on performance distributions for many instantiations per template, revealing variance and weak generalization.
Clause manipulation for complexity gradients: Each template is extended with Minus-1 (clause removed), Plus-1/Plus-2 (one or two clauses added), and NoOp distractor variants (irrelevant non-operational information), precisely quantifying the impact of surface complexity and distractors.
Isolation of memorization: The evaluation can differentiate memorization from robust deductive reasoning by observing performance as surface form and complexity vary systematically.

4. Evaluation Protocol and Metrics

Model evaluation adopts a standardized approach:

Prompting: 8-shot chain-of-thought (CoT) prompting, with greedy decoding.
Primary metric: Strict accuracy: $\mathrm{accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat{y_i} = y_i]$ , where $\hat{y_i}$ is the model’s answer for instance $i$ .
Secondary analyses:
- Variance of accuracy across instantiations for each template.
- Mean accuracy drop as a function of added or removed clauses.
- Catastrophic error rate on NoOp variants (insertion of irrelevant, but superficially plausible, clauses).

These metrics explicitly quantify (1) sensitivity to numeric or lexical perturbations, (2) fragility under increased surface complexity, and (3) model susceptibility to distractor content.

5. Key Empirical Findings on LLM Behavior

GSM-Symbolic’s design enabled several principal findings:

Numeric and lexical sensitivity: Varying only named entities (GSM-Symbolic-Names) yields mild accuracy variance $(\pm 2-4$ \,pp $)$ , but changing numeric values causes mean accuracy reductions of $5-15$\,pp for open-weight models, with variance expanding to $\pm 3-6$ \,pp. Closed models (GPT-4o, o1-preview) exhibit smaller but still consistent drops $(\approx 1-4$ \,pp $)$ .
Irrelevant clause (NoOp) vulnerability: Insertion of a single non-operational clause can cause average accuracy to collapse by up to $65$\,pp on certain models. Even with in-context demonstrations instructing the model to ignore such clauses, models do not recover their performance, suggesting deep pattern-matching rather than logical filtering.
Clause count and complexity: As the number of clauses (#clauses) increases from M1 (baseline) through Symbolic to Plus-1 and Plus-2, accuracy declines approximately linearly with an accelerating rate, and variance rises accordingly. This exposes a combinatorial explosion of potential pattern matches under even modest complexity increases.
Pattern-matching hypothesis: Observed behaviors suggest that LLMs do not engage in formal symbolic reasoning, but instead rely on sophisticated retrieval and pattern recombination learned from training traces. Even small numerical or surface changes can shift the model’s output towards alternative memorized solution scripts, reflecting strong token bias and absence of true clause relevancy assessment.

6. Implications for Benchmark Development and Model Assessment

The GSM-Symbolic paper yields concrete recommendations for future benchmarking and model evaluation strategies:

Use of parameterized templates: Incorporate controllable template families to probe model sensitivity under distributional shifts and systematically increase complexity.
Performance as a distribution: Report distributions and variances of model accuracy across multiple instantiations rather than single aggregate figures, to expose vulnerabilities to superficial variation.
Inclusion of distractor clauses and difficulty gradients: Integrate benign but plausible distractors and complexity ramps, to distinguish between rote memorization and genuine mathematical reasoning.
Assessment of generalization: Employ the full spectrum of clause counts, manipulations, and perturbations to expose patterns of exponential error compounding in multi-step arithmetic.

Overall, GSM-Symbolic establishes a fine-grained, highly controlled framework for diagnosing the limitations of current LLMs in mathematical reasoning. Despite strong headline accuracies on GSM8K, state-of-the-art models display fragility even under minor perturbations, significant susceptibility to surface-level distractions, and exponential accuracy decay with increased logical complexity. This benchmark redefines the measurement landscape for mathematical reasoning in LLMs and offers a robust platform for evaluating future advances in both model architecture and training methodology (Mirzadeh et al., 7 Oct 2024).

PDF Markdown Chat (Pro)

References (1)

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to GSM-Symbolic Benchmark.