Benchmarking Reasoning Robustness in Large Language Models

Published 6 Mar 2025 in cs.AI | (2503.04550v1)

Abstract: Despite the recent success of LLMs in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete data, suggesting a reliance on memorized patterns rather than systematic reasoning. Our closer examination reveals four key unique limitations underlying this issue:(1) Positional bias--models favor earlier queries in multi-query inputs but answering the wrong one in the latter (e.g., GPT-4o's accuracy drops from 75.8 percent to 72.8 percent); (2) Instruction sensitivity--performance declines by 5.0 to 7.5 percent in the Qwen2.5 Series and by 5.0 percent in DeepSeek-V3 with auxiliary guidance; (3) Numerical fragility--value substitution sharply reduces accuracy (e.g., GPT-4o drops from 97.5 percent to 82.5 percent, GPT-o1-mini drops from 97.5 percent to 92.5 percent); and (4) Memory dependence--models resort to guesswork when missing critical data. These findings further highlight the reliance on heuristic recall over rigorous logical inference, demonstrating challenges in reasoning robustness. To comprehensively investigate these robustness challenges, this paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps. This is achieved by an instruction-based approach to generate diverse datasets that closely resemble training distributions, facilitating a holistic robustness assessment and advancing the development of more robust reasoning frameworks. Bad character(s) in field Abstract.

Abstract PDF Upgrade to Chat

Summary

The paper presents Math-RoB, a novel benchmark for assessing LLM reasoning robustness under challenges such as positional bias and numerical perturbations.
It reveals that larger models follow instructions better but still suffer from issues like numerical fragility and memory dependence.
The findings imply that improving LLM performance demands moving from heuristic recall to systematic, resilient logical inference.

Benchmarking Reasoning Robustness in LLMs

The paper "Benchmarking Reasoning Robustness in LLMs" presents an analysis of reasoning robustness in LLMs and introduces a new benchmark called Math-RoB to assess the robustness of LLMs against various reasoning challenges.

Introduction

The paper identifies significant performance degradation in LLMs when exposed to novel or incomplete data, suggesting their reliance on memorized patterns rather than systematic reasoning. Four key limitations are identified: positional bias, instruction sensitivity, numerical fragility, and memory dependence. The research asserts that these models predominantly depend on heuristic recall at the expense of rigorous logical inference, presenting challenges in reasoning robustness.

Figure 1: Illustration of the identified lack of robustness in LLM reasoning, using DeepSeek-V3 as an example.

Benchmark Design

The paper introduces Math-RoB, a novel benchmark designed to exploit hallucinations triggered by missing information and to assess reasoning robustness holistically. Math-RoB consists of datasets that closely resemble training distributions, thus facilitating a thorough robustness assessment. Four robustness challenges are addressed using Math-RoB:

Positional Bias is examined using Math-RoB-RoLo by lengthening input sequences to evaluate models' ability to handle queries at different positions.
Instruction Sensitivity is assessed with Math-RoB-Define through operator substitutions and definition changes.
Numerical Fragility is explored via Math-RoB-Number by introducing numerical transformations.
Memory Dependence is evaluated with Math-RoB-Delete by observing model behavior under incomplete inputs.

Figure 2: Evaluation results of models showing improved performance with PRMs and resilience against interference.

Experimental Evaluation

Evaluations conducted across 12 models, both open- and closed-source, demonstrate the varying robustness levels. The use of process-supervised reward models (PRMs) and Monte Carlo Tree Search (MCTS) enhances logical reasoning. Larger models generally show a better ability to follow instructions and maintain reasoning robustness across different datasets such as Math-RoB-RoLo.

Figure 3: The performance drop rate of the model in Math-RoB-Define, illustrating declines in reasoning accuracy with certain models.

Results and Analysis

The paper highlights several findings:

Instruction Following: Larger models demonstrate better adherence to complex instructions, resulting in improved reasoning performance (Figure 4). However, significant challenges remain when handling multiple operator replacements.
Numerical Transformations: Numerical perturbations adversely affect model performance, though larger models like DeepSeek-V3 maintain robust reasoning performance.
Hallucination and Memory Dependence: With missing information, models often hallucinate missing data based on memorized patterns, raising issues when genuine logical inference is required.
Figure 4: Model instruction following and accuracy on Math-Rob-Define, emphasizing that larger models perform better.

Conclusion

The introduction of Math-RoB addresses the existing gap in robust reasoning benchmarks for LLMs. The findings stress that while larger models have shown modest improvements in instruction adherence, smaller models still struggle significantly with memorized training data, vulnerability to overfitting, and reduced robustness in more complex tasks (Figure 5). Math-RoB provides a foundation for future work aimed at shifting the focus from merely achieving high reasoning performance to enhancing reasoning robustness and reliability in LLMs.