Soft Contamination Means Benchmarks Test Shallow Generalization
This presentation examines a critical threat to AI benchmarking: soft contamination. While standard deduplication removes exact copies, semantic duplicates of benchmark problems pervade training data—even in open models like Olmo3. Through systematic audits of reasoning and coding benchmarks, the authors reveal that up to 100% of benchmark items have semantic duplicates in training corpora, driving measurable performance gains that reflect shallow, benchmark-specific learning rather than genuine reasoning advances. The findings challenge interpretations of recent AI progress and establish a new standard for contamination detection in the era of web-scale pretraining.Script
When language models ace a reasoning benchmark, are they truly reasoning or just interpolating over familiar patterns? This paper reveals a hidden flaw in how we measure AI progress: soft contamination that makes benchmarks test shallow pattern matching instead of deep generalization.
Building on that concern, let's examine what soft contamination actually means for benchmark integrity.
The challenge goes deeper than copy-paste duplication. Semantic duplicates share the same underlying structure or logic but appear in different forms, slipping past traditional decontamination methods. These hidden overlaps create a confound that standard filtering approaches simply cannot catch.
To quantify this hidden contamination, the researchers developed a comprehensive detection framework.
Following that framework, the authors embedded massive training corpora and computed semantic proximity for benchmark items across coding and reasoning tasks. They then deployed classifier models to distinguish true semantic duplicates from merely similar examples, enabling precise contamination estimates at scale.
Now let's examine what this systematic audit revealed about contamination prevalence.
These numbers paint a stark picture. For some benchmarks, contamination is nearly universal, even after aggressive deduplication. The finding that difficulty does not predict contamination rate suggests the problem is structural, not accidental.
This visualization demonstrates a critical insight: semantic duplicates appear uniformly across the difficulty spectrum. Whether a problem is rated easy or expert-level, it faces the same contamination risk, which means even our hardest benchmarks may not be testing what we think they test.
The contamination effect reveals a troubling asymmetry. Models finetuned on semantic duplicates improve dramatically on held-out problems within the same benchmark, yet show zero transfer to related tasks. This pattern is the signature of shallow, distribution-specific learning rather than the deep generalization we aim to measure.
Zooming out, these findings force us to reinterpret the rapid performance gains we've celebrated in recent years. Without semantic decontamination, we cannot distinguish true capability growth from increasingly efficient interpolation over benchmark-adjacent training data. The trajectory of progress itself becomes uncertain.
Soft contamination challenges the very foundation of how we evaluate language models, revealing that benchmarks may measure corpus coverage as much as reasoning skill. To learn more about cutting-edge AI research and stay informed on findings like these, visit EmergentMind.com.