Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

GSM-DC: Math Reasoning Under Distraction

Updated 9 July 2025

Grade School Math with Distracting Context (GSM-DC) is a benchmark that challenges models to isolate essential arithmetic cues while ignoring irrelevant details.
It leverages symbolic directed acyclic graphs and the InjectDistractors algorithm to precisely embed and control contextual noise in mathematical problems.
Empirical results reveal that increased distractor intensity degrades step and path accuracy, highlighting the need for advanced training and inference strategies.

Grade School Math with Distracting Context (GSM-DC) refers to the systematic paper and evaluation of mathematical reasoning abilities—particularly of LLMs—when exposed to grade-school-level arithmetic problems augmented with irrelevant or distracting information. GSM-DC challenges both computational models and, by analogy, human learners to disregard extraneous details and focus on extracting and manipulating only the essential numerical and logical content required to arrive at the correct answer. Recent research has formalized GSM-DC through synthetic benchmarks with precise control over distractor injection, offering a reproducible and rigorous platform for quantifying—and ultimately improving—reasoning robustness in the presence of noise (2505.18761).

1. Benchmark Definition and Construction

GSM-DC is constructed as a controlled synthetic benchmark where each mathematical problem is formalized as a symbolic directed acyclic graph (DAG). Within this framework:

Nodes represent intermediate quantities, sometimes corresponding directly to quantities referenced in natural language.
Edges denote problem dependencies—such as arithmetic operations or logical relationships—required to advance toward a solution.
Solution path ( $\mathcal{P}$ ) is a unique, topologically sorted traversal of the necessary nodes and operations needed to correctly solve the problem, as defined by the "clean" dependency graph $\mathcal{G}$ (2505.18761).

To simulate distracting context, GSM-DC employs an InjectDistractors algorithm:

Unused nodes in $\mathcal{G}$ (those not traversed by the solution path) are converted into distractor nodes.
These distractors are then integrated into an augmented graph $\mathcal{G}'$ , preserving acyclicity and introducing irrelevant cues without altering the core reasoning required for the answer.
The resulting problem is “realized” as a narrative math word problem where the chain-of-thought solution ( $\mathcal{S}$ ) may be intentionally obscured by extraneous detail.

The number and intensity of distractors ( $m$ ) are adjustable, allowing for “Light-IC”, “Medium-IC”, and “Hard-IC” variants that systematically scale the reasoning challenge and noise level (2505.18761).

2. Methodologies for Distractor Injection

The GSM-DC framework enables fine-grained, reproducible injection of irrelevant context by:

Selecting "unused" nodes: Nodes not on the unique solution path are sampled in batches.
Forward-only edge connection: Ensures that injected distractors do not create cycles or compromise the validity of the original solution path.
Flexible distractor quantity: The parameter $m$ modulates the number of distractors, yielding a family of datasets where the complexity of separating relevant from irrelevant information can be precisely adjusted (2505.18761).
Language realization: The final problem statement and its solution are automatically generated from the (augmented) symbolic graph using predefined templates, providing systematic coverage over a vast problem space.

This approach guarantees that the core arithmetic reasoning remains unchanged while introducing contextually plausible but operationally irrelevant additions.

3. Effects on Reasoning and Model Robustness

Empirical evaluations using GSM-DC demonstrate that introducing irrelevant context leads to notable degradation in LLM performance:

Step Accuracy (SAcc): The proportion of correctly completed intermediate reasoning steps drops as the number of distractors increases, even when final answer extraction from the model’s output remains relatively high.
Path Accuracy (PAcc): The likelihood of following the correct reasoning path decreases with more distractors, indicating that models often stray from the core sequence of operations when context becomes noisier (2505.18761).
Error scaling: The observed error rate follows a power-law trend:

$E(m; rs) \propto m^{\delta(rs)},$

where $rs$ is the number of reasoning steps and $\delta(rs)$ increases with reasoning depth. This reveals that deeper reasoning chains are disproportionately more susceptible to irrelevant context (2505.18761).

Generality: These vulnerabilities persist across both in-distribution (test-like) and out-of-distribution (OOD) evaluations, especially when models have not been exposed to strong distractors during training.

4. Mitigation Strategies: Training and Inference-time Enhancements

To counteract the detrimental effects of distracting context, researchers have proposed several strategies within GSM-DC:

Pretraining and Finetuning with Irrelevant Context: Training models on data with systematically injected strong distractors (especially "Hard-IC") leads to improved robustness in both in-distribution and OOD scenarios. Such exposure enables the model to better learn to filter out extraneous details (2505.18761).
Process-guided Inference: A novel inference-time method based on a stepwise Tree of Thoughts (ToT) search, guided by a learned Process Reward Model (PRM), is introduced in GSM-DC. In this framework:
- Candidate reasoning paths are expanded in a beam search.
- At each step, the PRM evaluates partial solutions $h_{1:t}$ and selects those most likely to conform to correct dependency structures and arithmetic operations.
- This approach significantly boosts OOD step accuracy (up to 6.29 percentage points in experiments), without harming standard accuracy (2505.18761).
- The process is formalized as a sequential expansion and scoring, where the highest-reward partial paths are prioritized for further completion.

5. Comparative Insights, Evaluation Metrics, and Broader Context

The GSM-DC benchmark enables rigorous, reproducible comparison across models and training schemes by:

Quantifying the impact of distractors: It distinguishes between mere answer extraction and true stepwise reasoning accuracy, revealing cases where a model may “guess” answers correctly without traversing the correct solution path.
Evaluating generalization: Robustness to distracting context is shown to be essential for downstream applicability of LLMs to real-world educational scenarios, where problems are rarely presented in ideal form.
Informing error modeling: The power-law formula for error trends provides a concise metric for evaluating the sensitivity of any reasoning architecture to context complexity.

In the wider context of LLM research, GSM-DC is directly related to findings from benchmarks like GSM-IC (2302.00093) and GSM-Plus (2402.19255), which demonstrate that even minor contextual perturbations—such as irrelevant sentences or altered question targets—can lead to precipitous drops in problem-solving accuracy. Other works (e.g., GSM-Symbolic (2410.05229), CMATH (2306.16636)) reinforce that robustness to distraction is lacking in most contemporary models unless specifically targeted by training or advanced inference methods.

6. Implications and Outlook

GSM-DC establishes that overcoming distractibility is a nontrivial challenge for LLM mathematical reasoning:

Model Training: Exposure to high-intensity noisy contexts during training (for example, by generating and finetuning on IC-rich curricula) is critical for advancing LLM robustness.
Inference Strategies: Beam search or tree-based integration of process reward models can help models “stay on track” during solution generation, especially under OOD distractor conditions.
Evaluation Paradigms: Future benchmarks and curriculum design for LLMs should move beyond simple answer-based metrics, incorporating explicit step and path accuracy, as well as fine-grained error decomposition.
Generalization: Experimental evidence suggests that combining hard distractor training and reward-guided search equips models not only to resist known types of contextual noise but also to generalize more robustly to unanticipated types of distraction.

In summary, GSM-DC provides a framework and methodology for systematically quantifying and improving the robustness of mathematical reasoning—particularly in grade school arithmetic—under the pervasive challenge of distracting context. By enabling precise control over both reasoning complexity and distractor injection, and by demonstrating the effectiveness of advanced inference mechanisms, GSM-DC lays the groundwork for the development and evaluation of models capable of resilient, context-insensitive reasoning in mathematical domains (2505.18761).

PDF Markdown Chat (Upgrade)

References (5)

How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark (2025)

Large Language Models Can Be Easily Distracted by Irrelevant Context (2023)

GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers (2024)

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (2024)

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? (2023)