Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Published 14 Jul 2025 in cs.LG, cs.AI, and cs.CL | (2507.10532v2)

Abstract: Reasoning in LLMs has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation results, we introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation. Using this leakage-free dataset, we show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary in mathematical reasoning, whereas random or incorrect rewards do not. Moreover, we conduct more fine-grained analyses to elucidate the factors underlying the different performance observed on the MATH-500 and RandomCalculation benchmarks. Consequently, we recommend that future studies evaluate models on uncontaminated benchmarks and, when feasible, test various model series to ensure trustworthy conclusions about RL and related methods.

Abstract PDF Upgrade to Chat

Authors (14)

Summary

The paper reveals that RL benefits on Qwen2.5 models may result from memorization due to data contamination rather than true reasoning.
It demonstrates through metrics like partial-prompt accuracy that spurious rewards inflate performance on contaminated benchmarks.
Using the clean RandomCalculation dataset, the study shows that only properly aligned rewards lead to stable, genuine reasoning improvements.

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Introduction

LLMs have increasingly relied on reinforcement learning (RL) to enhance their mathematical reasoning capabilities. Recent works suggest that RL methods, even those using seemingly random reward signals, can significantly improve LLM performance, especially on challenging mathematical benchmarks like MATH-500. However, this research paper investigates a critical confounding factor: the potential contamination of these datasets with training data, which skews the evaluation of RL enhancements. The Qwen2.5 model family, trained on extensive web-scale corpora, serves as a subject for examining the effects of data contamination. This study argues for clean, uncontaminated benchmarks to assess RL-driven improvements accurately.

Data Contamination Analysis

To discern whether the Qwen2.5 models benefit from genuine reasoning improvements or simple memorization due to contaminated datasets, the authors introduce metrics assessing data leakage. The key metrics include the partial-prompt completion rate and the partial-prompt answer accuracy. These metrics evaluate the ability of models to complete and answer questions when only provided with partial input.

Tests conducted on datasets like MATH-500 demonstrate that Qwen2.5 can regenerate substantial portions of questions and yield correct answers despite limited input. For instance, when presented with only 40% of a MATH-500 problem, Qwen2.5-Math-7B was able to regenerate and solve 39.2% of instances correctly. Such findings suggest that Qwen2.5 may not always be learning to reason through RL, but rather recalling memorized solutions.

Figure 1: Accuracy (\%) of Qwen and Llama models on the MATH-500 dataset under different generation configurations, using original questions as prompts.

Reinforcement Learning with Verifiable Rewards (RLVR)

The paper evaluates RLVR applied to the Qwen2.5 series, using both contaminated datasets like MATH-500 and a newly introduced clean dataset, RandomCalculation. The RandomCalculation dataset comprises newly generated math problems that have not been seen by the model before, serving as a control to accurately gauge the impact of RLVR on reasoning rather than memorization.

Experiments using RLVR with correct and spurious (random, inverted) reward signals reveal significant insights. On the clean RandomCalculation dataset, only properly aligned rewards lead to stable performance improvements, while spurious rewards result in instability and negligible gains in problem-solving accuracy. Conversely, on contaminated datasets, spurious reward signals can lead to apparent improvements due to the model simply recalling memorized solutions.

Figure 2: Performance of the Qwen2.5 series models on the RandomCalculation datasets under different configurations.

Detailed Analysis of Reward Configurations

The study further explores the effects of different reward configurations on model outputs. By measuring the similarity between model responses before and after RL using token-level KL divergence, the paper substantiates the hypothesis that MATH-500's spurious gains arise primarily from memorization. High similarity scores between pre- and post-RL outputs corroborate that the model retrieves memorized answers, rather than genuinely learning novel reasoning processes from RL.

Figure 3: KL distance of model outputs before and after RL.

Implications and Recommendations

The research highlights the critical flaw of relying on contaminated datasets to evaluate the effectiveness of RL in enhancing reasoning capabilities of LLMs. The findings underscore the necessity to isolate genuine improvements in reasoning, contingent on the use of uncontaminated benchmarks. Consequently, the authors advocate for the adoption of leakage-free datasets, such as RandomCalculation, to ensure that evaluations reflect authentic advancements in mathematical reasoning.

Conclusion

The paper conclusively argues that the purported benefits of RL enhancements on Qwen2.5 series, demonstrated on contaminated datasets, are largely attributable to memorization rather than genuine reasoning capability improvements. This realization prompts a re-evaluation of benchmark datasets and calls for a more rigorous approach in future RL evaluations. The introduction of a clean dataset like RandomCalculation is a crucial step in enabling accurate assessments of LLMs' reasoning abilities when augmented by RL techniques.

Figure 4: Accuracy on the MATH-500 for \qwenmathSevenB, \qwenmathSevenBInstruct, and \llamaEightBInstruct trained with RLVR under various reward signals.

Markdown Report Issue