Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination (2509.00072v2)

Published 26 Aug 2025 in cs.AI

Abstract: Capability evaluation of LLMs is increasingly shadowed by rising concerns of data contamination that cast doubts on whether static benchmarks measure genuine reasoning or mere memorization. We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers, harnessing the natural temporal structure of research publications where performance decay after knowledge cutoffs may indicate potential contamination. We evaluated 4 frontier model represented by 2 models of different knowledge cutoff dates per family on 1,643 multi-step reasoning questions synthesized from 20,277 arXiv papers stratified over 26 months, covering at least 6 months before and after all cutoff dates. Our results consistently showed a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates. We further performed a comparative analysis with previous longitudinal studies that reported significant post-cutoff performance decay using directly retrieved questions based on public data. we hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization, which effectively serves a mitigation strategy against benchmark contamination. We fully open source our code and dataset to aid reproducibility and advocate for a paradigm shift that prioritize reasoning-driven synthesis to construct benchmarks over simply collecting newly released questions periodically.

Summary

The paper introduces a novel reasoning-driven synthesis framework to mitigate benchmark contamination in LLM evaluations.
It utilizes a systematic synthesis pipeline to convert arXiv theorem statements into research-level, multi-step QA pairs with automated verification.
Empirical results demonstrate stable performance across temporal cutoffs, confirming the method's effectiveness in reducing memorization artifacts.

Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination

Introduction

The discussed paper focuses on the increasingly critical issue of data contamination in the evaluation of LLMs. LLMs are often evaluated using static benchmarks, which have been criticized for potentially capturing performance that stems from mere memorization rather than genuine reasoning capabilities. The authors present an empirical paper to address this issue by proposing a new framework that utilizes reasoning-driven synthesis to mitigate benchmark contamination. This approach is implemented through the creation of research-level question-answer (QA) pairs derived from arXiv papers.

Methodology

Synthesis Pipeline

The methodology involves a pipeline that leverages arXiv publications to generate sophisticated QA pairs. The synthesis process includes retrieval of papers, extraction of theorem statements, and conversion into multi-step reasoning problems. The pipeline employs LLMs for the systematic parsing and identification of suitable theorems, focusing on those with definitive solutions to ensure automated verification. The framework ensures each QA pair involves multiple steps of reasoning, including strategic concept application and logical inference, to prevent solutions derived from pattern recognition.

Dataset Construction

The dataset comprises thousands of synthesized questions spanning mathematics and physics. These questions are constructed with monthly temporal resolution, covering periods both before and after specific model knowledge cutoff dates. This temporal stratification allows for the assessment of performance decay, which indicates potential contamination. The dataset processes publications over 26 months, extracting theorems suitable for complex problem formulation.

Experimental Results

The paper evaluates a range of LLMs, including models from OpenAI, Gemini, Llama, and DeepSeek, assessing performance across synthesized QA datasets. The experimental results demonstrated consistent model performance across temporal knowledge cutoff boundaries, evidencing a lack of performance decay that would typically signify data contamination. As shown in various figures, the empirical results indicate that the reasoning-driven QA synthesis effectively mitigates contamination risks.

Discussion

Contamination Vulnerability vs. Synthesis Approach

Previous retrieval-based benchmarks have exhibited significant performance degradation near knowledge cutoffs, attributed to data memorization. In contrast, the proposed synthesis approach involves transforming the retrieved content into more complex problems, thus creating cognitive barriers beyond simple memorization. Through reasoning-driven synthesis, the authors argue for a fundamental shift toward complex problem formulation in benchmarks to ensure genuine assessment of model capabilities.

Theoretical and Practical Implications

This research advocates for reasoning-driven synthesis as a preferred method for scaling evaluation frameworks that maintain contamination resistance. It also suggests potential shifts in benchmark creation paradigms, particularly in the context of rapidly evolving LLM capabilities. By focusing on reasoning complexity, the approach better reflects models' potential to facilitate scientific discovery and problem solving beyond static datasets.

Conclusion

The paper presents reasoning-driven synthesis as an effective strategy for mitigating benchmark contamination. By comprehensively evaluating LLMs across temporally stratified, synthesized datasets, the authors provide evidence of the approach's ability to resist contamination through deeper cognitive complexity. The research calls for a paradigmatic shift towards multifaceted, reasoning-intensive benchmarks, with the potential to revolutionize how LLM capabilities are measured against evolving scientific literature. This pivot in benchmark construction may enhance the fidelity and reliability of AI assessments in future applications.