N-gram Contamination in Language Models
- N-gram-based contamination definitions are formal criteria that detect overlaps between evaluation benchmarks and pre-training corpora by measuring contiguous token sequences.
- These methods use metrics like n-gram precision, Jaccard index, and longest substring matching to quantify contamination and guide threshold-based classification.
- While offering simplicity and scalability, n-gram techniques face limitations such as sensitivity to paraphrasing, prompting the use of complementary semantic approaches.
N-gram-based contamination definitions refer to a class of formal criteria for identifying when entries from evaluation sets (benchmarks) are present verbatim or near-verbatim in the pre-training corpus of LLMs. Such overlap directly threatens the validity of performance claims for modern LMs by confounding generalization with memorization. Although the field has moved toward more sophisticated, semantic-aware metrics, n-gram-based methods remain foundational for contamination auditing due to their formal simplicity, scalability, and interpretability.
1. Formal Definitions of N-gram-based Contamination
The core construct of n-gram-based contamination is the set of all contiguous subsequences of length (n-grams) extracted from a token sequence. Formally, for a sequence and integer , the n-grams are
Given an evaluation example and corpus document , two standard overlap metrics are:
- N-gram Precision:
- Jaccard Index:
Entry is typically labeled "contaminated" with respect to if the overlap metric exceeds a threshold 0.
Several prominent formalizations have been adopted in practice:
- Direct n-gram overlap: Contaminated if 1.
- PaLM criterion: 2, 3; 4 is contaminated if at least 5 of its 8-grams are found in any corpus document.
- Llama 2 criterion: 6 (typically 7), slide a window across 8, labeling tokens as contaminated if they belong to any matching 9-gram; define contamination percentage per sample and use thresholds to classify as Clean, Not Clean, Not Dirty, or Dirty (Jiang et al., 2024).
Another common instantiation is the per-entry overlap rate:
0
where 1 is the set of n-grams in 2 and 3 is the frequency of 4 in the corpus 5 (Xu et al., 13 Jun 2025).
2. Practical Contamination Metrics and Classification
Pipelines for measuring n-gram contamination in large benchmarks and corpora typically implement:
| Metric | Definition | Usage Example |
|---|---|---|
| Union (“match”) | 6 | (Singh et al., 2024) |
| Tokenwise (“chunk”) | Fraction of tokens in 7 in any matching n-gram | Llama 2 contamination percentage |
| Longest-substring | 8 with 9 the longest substring (in tokens) in corpus | (Singh et al., 2024), recommended by ConTAM |
| Character n-gram | As above, but over raw bytes or characters | Infini-gram mini (Xu et al., 13 Jun 2025) |
Classification is made by applying thresholds to these scores. For example:
- Infini-gram mini: 0 characters; entries classified as Clean (1), Suspicious (2), Dirty (3) (Xu et al., 13 Jun 2025).
- Llama 2: Clean if contamination percentage 4, Dirty if above 5 (Jiang et al., 2024).
- ConTAM: Thresholds empirically chosen per model/benchmark via maximizing estimated performance gain (EPG), not fixed a priori (Singh et al., 2024).
3. Algorithmic Approaches and Hyperparameter Sensitivity
Computing n-gram overlap at internet scale relies on compressed, index-based search. Infini-gram mini uses FM-indexes to allow substring search on petabyte-scale corpora, extracting overlapping n-grams with fine stride for sensitivity (Xu et al., 13 Jun 2025).
Key hyperparameters affecting both sensitivity and specificity:
- n-gram length (6): Small 7 increases recall but raises false positives (spurious matches); large 8 ensures precision but induces false negatives due to minor edits or paraphrases.
- Frequency threshold (9): Imposing 0 (i.e., discarding rare n-grams) can suppress noise but increases false negatives (Singh et al., 2024).
- Skip budget (1): Allowing mismatches (e.g., token substitutions) along the substring increases resilience to minor edits but shows marginal practical value (Singh et al., 2024).
An empirically robust configuration is 2 tokens, 3, skip budget 4; longer n-grams or stricter frequency thresholds systematically reduce the measured contamination rate but risk missing "memorized" examples.
4. Limitations and Failure Modes
N-gram-based contamination metrics exhibit several well-documented limitations:
- Susceptibility to paraphrase: Any non-exact rewording evades detection, leading to high false negative rates. Embedding-based or syntax-based measures are required to capture such soft contamination (Jiang et al., 2024, Spiesberger et al., 12 Feb 2026).
- False positives due to generic content: Short n-grams are likely to be matched by chance, especially in high-volume web text, inflating the contamination estimate with semantically unrelated passages.
- Lack of context sensitivity: Mere token sequence overlap does not distinguish between semantic alignment and coincidental reuse (e.g., "bank account" versus "river bank").
- Threshold arbitrariness: Different choices of 5 or contamination percentage threshold 6 can substantially alter the fraction of test items labeled contaminated, with little impact on actual measured generalization (Jiang et al., 2024).
- Ground-truth blindness: N-gram rules ignore leakage of gold outputs (answers/labels), which can be memorized and significantly boost apparent model capability even when inputs do not overlap textually.
Consequently, n-gram-based statistics provide an upper bound on surface-level leakage yet do not guarantee detection of all harmful forms of memorization.
5. Empirical Findings and Benchmark Analyses
Systematic large-scale contamination audits reveal variable but often substantial test overlap in public corpora:
| Benchmark | Corpus | Dirty Rate (n-gram, typical n) | Reference |
|---|---|---|---|
| ZebraLogic | Olmo3 | 49.5% (n=13, exact 13-gram) | (Spiesberger et al., 12 Feb 2026) |
| CodeForces | Olmo3 | 77.5% (semantic + exact) | (Spiesberger et al., 12 Feb 2026) |
| SQuAD | DCLM-baseline | 40.1% (n=50 char, dirty 7 0.8) | (Xu et al., 13 Jun 2025) |
| MMLU | DCLM-baseline | 27.7% | (Xu et al., 13 Jun 2025) |
| MBPP | Olmo3 | 100% (semantic, no exact) | (Spiesberger et al., 12 Feb 2026) |
Infini-gram mini found that reading-comprehension and commonsense benchmarks (e.g., SQuAD, ARC) regularly exceed 30–40% dirty rate in large, web-scale corpora; knowledge-reasoning and code benchmarks exhibit variable rates (Xu et al., 13 Jun 2025).
Multiple studies report that filtering large fractions of entries flagged by n-gram overlap alters downstream model performance only minimally, evidencing the low specificity of these filters for actual memorization (Jiang et al., 2024). However, high dirty rates raise strong concerns about the inflation of benchmark scores in evaluations, emphasizing the need for de-duplication and the development of more resilient benchmarks.
6. Methodological Extensions and Alternatives
A clear trend is the development of complementary or alternative contamination definitions:
- Semantic contamination (a.k.a. soft contamination): Entries are flagged as contaminated if a semantic encoding (e.g., via sentence embeddings) achieves cosine similarity 8 with any pre-training example (Spiesberger et al., 12 Feb 2026).
- Substring edit distance: Certain studies replace n-gram overlap with substring-level Levenshtein similarity or AST k-gram metrics, especially for code benchmarks, to capture near-duplicates (Riddell et al., 2024).
- Performance-grounded metrics (ConTAM): The contamination threshold is selected not by surface heuristics, but by the empirical increase in model accuracy when contaminated examples are included, providing an effect-size-guided method for threshold optimization (Singh et al., 2024).
These methods preserve interpretability while improving robustness to superficial text alterations and aligning contamination designations more closely with observed model behavior.
7. Best Practices and Future Directions
Current best-practice recommendations to ensure rigorous contamination analysis include:
- Prefer the longest-match score (i.e., the normalized length of the longest matching substring) as a primary metric, with n-gram union as a sanity check (Singh et al., 2024).
- Hyperparameters: use n=8 tokens, frequency filter 9, skip budget zero.
- Empirically determine contamination thresholds by profiling the actual performance gain for flagged items (ConTAM), not by arbitrary or inherited convention.
- Manually review flagged/unflagged samples near the chosen threshold.
- Supplement surface-level overlap with semantic or editing-robust similarity to capture paraphrased leaks (Spiesberger et al., 12 Feb 2026).
- Proactively de-duplicate or sanitize both training and evaluation sets using scalable substring search (e.g., FM-indexing) when working with large web-scale corpora (Xu et al., 13 Jun 2025).
These combined approaches form the current methodological backbone for contamination detection and reporting in LLM pretraining and evaluation.