How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark (2505.18761v1)

Published 24 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Grade School Math with Distracting Context (GSM-DC), a synthetic benchmark to evaluate LLMs' (LLMs) reasoning robustness against systematically controlled irrelevant context (IC). GSM-DC constructs symbolic reasoning graphs with precise distractor injections, enabling rigorous, reproducible evaluation. Our experiments demonstrate that LLMs are significantly sensitive to IC, affecting both reasoning path selection and arithmetic accuracy. Additionally, training models with strong distractors improves performance in both in-distribution and out-of-distribution scenarios. We further propose a stepwise tree search guided by a process reward model, which notably enhances robustness in out-of-distribution conditions.

Summary

The paper introduces GSM-DC, a controlled benchmark using symbolic dependency graphs to evaluate how irrelevant context distracts Large Language Models (LLMs) during mathematical reasoning.
Experiments show that LLM reasoning accuracy significantly degrades as the intensity of irrelevant context increases, particularly with deeper reasoning steps.
Training LLMs with strategically injected irrelevant context and employing inference-time methods like reward-guided tree search can enhance their robustness against such distractions.

Analyzing LLMs' Sensitivity to Irrelevant Context in Mathematical Reasoning

The paper "How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark" explores a critical shortcoming in the reasoning abilities of LLMs - sensitivity to irrelevant contextual information. The authors present a synthetic benchmark named Grade School Math with Distracting Context (GSM-DC) to empirically evaluate and enhance the robustness of LLMs under controlled distraction from irrelevant context (IC).

Benchmark Design and Methodology

GSM-DC is designed to facilitate rigorous assessment of LLMs by constructing math problems represented as symbolic dependency graphs. This structure allows for systematic injection of irrelevant distractor nodes and edges while maintaining the integrity of the correct solution path. The benchmark employs precise control over complexity and distractor structure by varying the depth of reasoning steps, denoted as $rs$ , and intensity levels of IC.

Key Components:

Dependency Graph Construction: Symbolic DAGs are utilized to represent mathematical problems, enabling explicit manipulation of IC without affecting the solution path.
Irrelevant Context Injection: Distractors are strategically added to nodes outside the solution path, imitating real-world scenarios where extraneous information can interfere with reasoning.
Natural Language Realization: The symbolic graphs are converted into human-readable math problems using structured templates and a hierarchical vocabulary system from the GSM8K dataset.
Stepwise Solution Evaluator: A specialized evaluation metric captures step accuracy, path accuracy, and extraction answer accuracy, offering fine-grained analysis of LLMs' reasoning chains.

Experimental Findings

The paper presents extensive experiments analyzing the impact of IC on LLM reasoning accuracy across different levels of reasoning depth. Noteworthy findings include:

Degradation in Accuracy: As the intensity of irrelevant context increases, there is a notable decline in reasoning accuracy across multiple models such as Grok-3-Beta, GPT-4.1, and LLaMA versions. This degradation is particularly pronounced with deeper reasoning depths, suggesting an inherent challenge for LLMs to maintain focus on the relevant reasoning path amid distractions.
Training with Irrelevant Context: Models trained with injected IC demonstrate enhanced robustness compared to those trained without IC exposure. The finetuning strategies explored, including LoRA and continued pretraining, highlight the effectiveness of challenging IC during training, with continued pretraining on hard IC yielding significant robustness against distractions.
Inference Time Improvements: The integration of a stepwise tree search algorithm guided by Process Reward Models (PRM) further boosts robustness, especially in out-of-distribution settings. This adaptive search method shows potential in advancing robustness beyond traditional supervised training.

Implications and Future Directions

This paper underscores the importance of structured evaluation and training approaches to address the vulnerabilities of LLMs in reasoning tasks involving irrelevant context. Practically, it suggests that embedding irrelevant context during training and utilizing reward-guided search algorithms can alleviate distraction effects, enhancing LLM reliability and application in complex, real-world scenarios.

Theoretically, the findings contribute to a deeper understanding of the inductive biases and limitations inherent in LLM architectures when confronted with extraneous information. Future research could explore diverse architectural adaptations and reinforcement learning methods to further improve focus and reasoning accuracy.

Conclusion

The introduction of GSM-DC marks a significant step towards diagnosing and mitigating the impact of irrelevant context on LLM reasoning. Through meticulous benchmark design and analysis, the paper offers valuable insights and strategies for advancing LLM robustness, paving the way for more effective application of these models in complex decision-making and problem-solving domains.

YouTube

Show All Videos