- The paper shows that reported progress in LM reasoning is undermined by evaluation sensitivity to random seed, decoding parameters, and hardware variations.
- The study finds that averaging results over multiple seeds and standardizing evaluation environments are essential to avoid misleading performance gains.
- The authors propose a reproducible evaluation framework and best practices, highlighting SFT as a more robust approach than certain RL methods for math reasoning tasks.
This paper, "A Sober Look at Progress in LLM Reasoning: Pitfalls and Paths to Reproducibility" (2504.07086), conducts a comprehensive empirical paper on LLM (LM) reasoning, specifically focusing on mathematical reasoning benchmarks. The core finding is that reported progress in this area is often based on evaluations highly sensitive to subtle implementation and environmental factors, leading to potentially misleading conclusions and poor reproducibility. The authors argue that many performance gains attributed to novel methods, particularly Reinforcement Learning (RL), may fall within the inherent variance of evaluation setups.
The paper identifies several key pitfalls in current evaluation practices that contribute to this instability:
- Random Seed Variance: Evaluations often rely on single random seeds, which can yield highly unstable results, especially on small benchmarks like AIME'24 (30 samples) and AMC'23 (40 samples). A single correct answer can shift Pass@1 by 2.5-3.3 percentage points. Running evaluations with multiple seeds reveals standard deviations of 5-15 percentage points across models and datasets [(2504.07086), Figure 2].
- Implementation Consideration: For reliable evaluation, running and averaging results over at least 10 random seeds is crucial, particularly for small datasets. Reporting the standard deviation along with the mean provides a measure of confidence in the reported performance.
- Sampling Parameter Sensitivity: Decoding parameters like temperature and
top_p
significantly impact performance and variability. Higher temperatures can yield better peak accuracy but increase instability [(2504.07086), Figure 4]. top_p
also affects performance, and different optimal values might exist for different models [(2504.07086), Figure 5].
- Implementation Consideration: These parameters should be tuned for each model individually to achieve its best performance before comparison. Consistent values must then be used across all tasks for that specific model during evaluation.
- Hardware and Software Stack Variability: The paper shows that running the exact same evaluation code with the same model, parameters, and seeds on different compute clusters (varying GPUs, memory, etc.) or using different evaluation frameworks (like LightEval vs. Evalchemy) can result in noticeable performance differences (up to 8% observed) [(2504.07086), Figure 6, Table 2]. This highlights that low-level library optimizations or non-determinism can impact results.
- Implementation Consideration: To ensure reproducibility, the entire evaluation environment, including hardware specifications and software versions (libraries, CUDA, etc.), must be standardized and ideally distributed (e.g., via Docker) for others to replicate.
- Prompt Format and Context Length: Using inappropriate prompt templates or limiting the maximum number of output tokens can significantly degrade performance, especially for instruction-tuned models that expect a specific format [(2504.07086), Figure 8]. Premature truncation of reasoning chains due to short
max_new_tokens
is a common issue [(2504.07086), Figure 7].
- Implementation Consideration: Always use the model's native chat template and ensure sufficient context length (e.g., 32,768 tokens or more, if supported and needed for long reasoning) to avoid performance degradation.
- Answer Matching Robustness: Relying on exact string matching for answer verification can be brittle. Differences in LaTeX formatting or minor variations can lead to false negatives.
- Implementation Consideration: Employ a robust answer extraction and verification pipeline (like LightEval's LaTeX parsing and equivalence check) that tolerates formatting variations.
To address these pitfalls, the authors propose a standardized evaluation framework and a set of best practices:
- Standardized Stack: Release code, prompts, and model outputs, preferably within a containerized environment (like Docker) runnable on publicly accessible cloud instances for true reproducibility.
- Multi-Seed Evaluation: Use at least 10 random seeds for small benchmarks and report mean ± standard deviation.
- Hyperparameter Tuning: Optimize decoding parameters per model and fix them across tasks.
- Context & Prompting: Use sufficient context length and correct prompt templates.
- Robust Verification: Implement reliable answer extraction and matching.
- Transparency: Clearly document the setup and release all components.
Using their standardized framework, the authors re-evaluated several recent models, yielding a "sober" look at current progress (see [(2504.07086), Table 3]). Their findings challenge some prior claims:
- RL on DeepSeek R1-Distill: Many RL-trained variants (like L1, OpenRS, Still-3, Light-R1) showed little to no statistically significant improvement over the SFT base model (R1-Distill). Some (like II-Thought, FastCuRL) showed modest gains on AIME'24 but failed to generalize to AIME'25, suggesting overfitting. DeepScaleR was an exception, showing robust, significant gains.
- RL on Qwen2.5 Math/Base: RL methods (Oat-Zero, LIMR, SimpleRL-Zoo) generally achieved statistically significant gains over base models, but often still underperformed the instruction-tuned Qwen2.5 Math variant. Again, overfitting was observed on AIME'24 vs AIME'25. Open Reasoner-Zero was a notable exception, consistently outperforming the instruct baseline by larger margins.
- Supervised Fine-tuning (SFT): SFT models (s1.1, Eurus2 Prime, Bespoke Stratos, OpenR1, OpenThinker, OpenThinker2) on reasoning traces consistently outperformed instruction-tuned baselines and showed better generalization to AIME'25, highlighting SFT's maturity and robustness when scaled.
The paper also investigated two specific phenomena:
- Response Length and Accuracy: They found a consistent pattern where incorrect responses tend to be longer than correct ones, even for non-truncated outputs [(2504.07086), Figure 9, Appendix Figures 10, 11]. This suggests longer outputs can indicate failure modes and can serve as a heuristic for consensus mechanisms or detecting low-confidence generations.
- Diversity Collapse: Contrary to prior claims, the authors did not observe a consistent "diversity collapse" where Pass@k decreases despite Pass@1 increasing in RL-trained models. Generally, Pass@k improved alongside Pass@1 or decreased only when Pass@1 also decreased [(2504.07086), Table 4].
Practical Implications:
For practitioners aiming to implement or apply LLMs for reasoning tasks:
- Benchmarking: Do not rely on single-run results. Average over multiple seeds and report variance. Control your hardware and software environment meticulously. Use standard, robust evaluation tools.
- Model Selection: SFT on high-quality reasoning data appears to be a more reliable path to generalizable improvements for math reasoning compared to current RL methods, although specific RL implementations (like Open Reasoner Zero or DeepScaleR) show promise.
- Inference Configuration: Be mindful of decoding parameters, prompt formatting, and
max_new_tokens
. Suboptimal choices can significantly hurt performance. Validate inference configurations carefully.
- Output Quality: The length of a generated reasoning trace can be a practical signal for its correctness. Consider using this in applications (e.g., for filtering or weighting responses in consensus methods).
- Deployment: Variance and hardware sensitivity observed during evaluation suggest potential challenges in achieving consistent performance across different production environments. Robustness testing is essential.
In conclusion, the paper provides a valuable reality check on the state of LLM reasoning evaluation, emphasizing the critical need for methodological rigor and transparency. While RL shows some potential, especially on base models, SFT currently appears to be a more robust and generalizable approach for improving mathematical reasoning capabilities.