Papers
Topics
Authors
Recent
Search
2000 character limit reached

N-gram Contamination in Language Models

Updated 4 April 2026
  • N-gram-based contamination definitions are formal criteria that detect overlaps between evaluation benchmarks and pre-training corpora by measuring contiguous token sequences.
  • These methods use metrics like n-gram precision, Jaccard index, and longest substring matching to quantify contamination and guide threshold-based classification.
  • While offering simplicity and scalability, n-gram techniques face limitations such as sensitivity to paraphrasing, prompting the use of complementary semantic approaches.

N-gram-based contamination definitions refer to a class of formal criteria for identifying when entries from evaluation sets (benchmarks) are present verbatim or near-verbatim in the pre-training corpus of LLMs. Such overlap directly threatens the validity of performance claims for modern LMs by confounding generalization with memorization. Although the field has moved toward more sophisticated, semantic-aware metrics, n-gram-based methods remain foundational for contamination auditing due to their formal simplicity, scalability, and interpretability.

1. Formal Definitions of N-gram-based Contamination

The core construct of n-gram-based contamination is the set of all contiguous subsequences of length nn (n-grams) extracted from a token sequence. Formally, for a sequence X=(t1,...,tX)X = (t_1, ..., t_{|X|}) and integer nn, the n-grams are

NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}

Given an evaluation example EE and corpus document dd, two standard overlap metrics are:

  • N-gram Precision:

precn(d,E)=NG(d,n)NG(E,n)NG(E,n)\mathrm{prec}_n(d, E) = \frac{|\mathrm{NG}(d, n) \cap \mathrm{NG}(E, n)|}{|\mathrm{NG}(E, n)|}

  • Jaccard Index:

Jaccn(d,E)=NG(d,n)NG(E,n)NG(d,n)NG(E,n)\mathrm{Jacc}_n(d, E) = \frac{|\mathrm{NG}(d, n) \cap \mathrm{NG}(E, n)|}{|\mathrm{NG}(d, n) \cup \mathrm{NG}(E, n)|}

Entry EE is typically labeled "contaminated" with respect to dd if the overlap metric exceeds a threshold X=(t1,...,tX)X = (t_1, ..., t_{|X|})0.

Several prominent formalizations have been adopted in practice:

  • Direct n-gram overlap: Contaminated if X=(t1,...,tX)X = (t_1, ..., t_{|X|})1.
  • PaLM criterion: X=(t1,...,tX)X = (t_1, ..., t_{|X|})2, X=(t1,...,tX)X = (t_1, ..., t_{|X|})3; X=(t1,...,tX)X = (t_1, ..., t_{|X|})4 is contaminated if at least X=(t1,...,tX)X = (t_1, ..., t_{|X|})5 of its 8-grams are found in any corpus document.
  • Llama 2 criterion: X=(t1,...,tX)X = (t_1, ..., t_{|X|})6 (typically X=(t1,...,tX)X = (t_1, ..., t_{|X|})7), slide a window across X=(t1,...,tX)X = (t_1, ..., t_{|X|})8, labeling tokens as contaminated if they belong to any matching X=(t1,...,tX)X = (t_1, ..., t_{|X|})9-gram; define contamination percentage per sample and use thresholds to classify as Clean, Not Clean, Not Dirty, or Dirty (Jiang et al., 2024).

Another common instantiation is the per-entry overlap rate:

nn0

where nn1 is the set of n-grams in nn2 and nn3 is the frequency of nn4 in the corpus nn5 (Xu et al., 13 Jun 2025).

2. Practical Contamination Metrics and Classification

Pipelines for measuring n-gram contamination in large benchmarks and corpora typically implement:

Metric Definition Usage Example
Union (“match”) nn6 (Singh et al., 2024)
Tokenwise (“chunk”) Fraction of tokens in nn7 in any matching n-gram Llama 2 contamination percentage
Longest-substring nn8 with nn9 the longest substring (in tokens) in corpus (Singh et al., 2024), recommended by ConTAM
Character n-gram As above, but over raw bytes or characters Infini-gram mini (Xu et al., 13 Jun 2025)

Classification is made by applying thresholds to these scores. For example:

  • Infini-gram mini: NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}0 characters; entries classified as Clean (NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}1), Suspicious (NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}2), Dirty (NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}3) (Xu et al., 13 Jun 2025).
  • Llama 2: Clean if contamination percentage NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}4, Dirty if above NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}5 (Jiang et al., 2024).
  • ConTAM: Thresholds empirically chosen per model/benchmark via maximizing estimated performance gain (EPG), not fixed a priori (Singh et al., 2024).

3. Algorithmic Approaches and Hyperparameter Sensitivity

Computing n-gram overlap at internet scale relies on compressed, index-based search. Infini-gram mini uses FM-indexes to allow substring search on petabyte-scale corpora, extracting overlapping n-grams with fine stride for sensitivity (Xu et al., 13 Jun 2025).

Key hyperparameters affecting both sensitivity and specificity:

  • n-gram length (NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}6): Small NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}7 increases recall but raises false positives (spurious matches); large NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}8 ensures precision but induces false negatives due to minor edits or paraphrases.
  • Frequency threshold (NG(X,n)={(ti,...,ti+n1):1iXn+1}\mathrm{NG}(X, n) = \{ (t_i, ..., t_{i+n-1}) : 1 \leq i \leq |X| - n + 1 \}9): Imposing EE0 (i.e., discarding rare n-grams) can suppress noise but increases false negatives (Singh et al., 2024).
  • Skip budget (EE1): Allowing mismatches (e.g., token substitutions) along the substring increases resilience to minor edits but shows marginal practical value (Singh et al., 2024).

An empirically robust configuration is EE2 tokens, EE3, skip budget EE4; longer n-grams or stricter frequency thresholds systematically reduce the measured contamination rate but risk missing "memorized" examples.

4. Limitations and Failure Modes

N-gram-based contamination metrics exhibit several well-documented limitations:

  • Susceptibility to paraphrase: Any non-exact rewording evades detection, leading to high false negative rates. Embedding-based or syntax-based measures are required to capture such soft contamination (Jiang et al., 2024, Spiesberger et al., 12 Feb 2026).
  • False positives due to generic content: Short n-grams are likely to be matched by chance, especially in high-volume web text, inflating the contamination estimate with semantically unrelated passages.
  • Lack of context sensitivity: Mere token sequence overlap does not distinguish between semantic alignment and coincidental reuse (e.g., "bank account" versus "river bank").
  • Threshold arbitrariness: Different choices of EE5 or contamination percentage threshold EE6 can substantially alter the fraction of test items labeled contaminated, with little impact on actual measured generalization (Jiang et al., 2024).
  • Ground-truth blindness: N-gram rules ignore leakage of gold outputs (answers/labels), which can be memorized and significantly boost apparent model capability even when inputs do not overlap textually.

Consequently, n-gram-based statistics provide an upper bound on surface-level leakage yet do not guarantee detection of all harmful forms of memorization.

5. Empirical Findings and Benchmark Analyses

Systematic large-scale contamination audits reveal variable but often substantial test overlap in public corpora:

Benchmark Corpus Dirty Rate (n-gram, typical n) Reference
ZebraLogic Olmo3 49.5% (n=13, exact 13-gram) (Spiesberger et al., 12 Feb 2026)
CodeForces Olmo3 77.5% (semantic + exact) (Spiesberger et al., 12 Feb 2026)
SQuAD DCLM-baseline 40.1% (n=50 char, dirty EE7 0.8) (Xu et al., 13 Jun 2025)
MMLU DCLM-baseline 27.7% (Xu et al., 13 Jun 2025)
MBPP Olmo3 100% (semantic, no exact) (Spiesberger et al., 12 Feb 2026)

Infini-gram mini found that reading-comprehension and commonsense benchmarks (e.g., SQuAD, ARC) regularly exceed 30–40% dirty rate in large, web-scale corpora; knowledge-reasoning and code benchmarks exhibit variable rates (Xu et al., 13 Jun 2025).

Multiple studies report that filtering large fractions of entries flagged by n-gram overlap alters downstream model performance only minimally, evidencing the low specificity of these filters for actual memorization (Jiang et al., 2024). However, high dirty rates raise strong concerns about the inflation of benchmark scores in evaluations, emphasizing the need for de-duplication and the development of more resilient benchmarks.

6. Methodological Extensions and Alternatives

A clear trend is the development of complementary or alternative contamination definitions:

  • Semantic contamination (a.k.a. soft contamination): Entries are flagged as contaminated if a semantic encoding (e.g., via sentence embeddings) achieves cosine similarity EE8 with any pre-training example (Spiesberger et al., 12 Feb 2026).
  • Substring edit distance: Certain studies replace n-gram overlap with substring-level Levenshtein similarity or AST k-gram metrics, especially for code benchmarks, to capture near-duplicates (Riddell et al., 2024).
  • Performance-grounded metrics (ConTAM): The contamination threshold is selected not by surface heuristics, but by the empirical increase in model accuracy when contaminated examples are included, providing an effect-size-guided method for threshold optimization (Singh et al., 2024).

These methods preserve interpretability while improving robustness to superficial text alterations and aligning contamination designations more closely with observed model behavior.

7. Best Practices and Future Directions

Current best-practice recommendations to ensure rigorous contamination analysis include:

  • Prefer the longest-match score (i.e., the normalized length of the longest matching substring) as a primary metric, with n-gram union as a sanity check (Singh et al., 2024).
  • Hyperparameters: use n=8 tokens, frequency filter EE9, skip budget zero.
  • Empirically determine contamination thresholds by profiling the actual performance gain for flagged items (ConTAM), not by arbitrary or inherited convention.
  • Manually review flagged/unflagged samples near the chosen threshold.
  • Supplement surface-level overlap with semantic or editing-robust similarity to capture paraphrased leaks (Spiesberger et al., 12 Feb 2026).
  • Proactively de-duplicate or sanitize both training and evaluation sets using scalable substring search (e.g., FM-indexing) when working with large web-scale corpora (Xu et al., 13 Jun 2025).

These combined approaches form the current methodological backbone for contamination detection and reporting in LLM pretraining and evaluation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to N-gram-based Contamination Definitions.