Develop comprehensive methods for tabular contamination detection beyond row-level deduplication

Develop comprehensive contamination detection and decontamination methodologies for tabular datasets used in training and evaluating foundation models that go beyond row-level deduplication, accounting for column-name variations, multi-source duplication, and task-level leakage that can make evaluation tasks solvable via memorized associations rather than tabular reasoning.

Background

The paper demonstrates multiple contamination modes in TLM training corpora—complete overlap, direct label exposure across duplicated tables, and task-level leakage—showing that standard row-level deduplication fails to detect or mitigate these issues. Given the vast and heterogeneous nature of tabular corpora, robust detection methods are necessary to ensure valid evaluation.

The authors note the absence of established best practices and emphasize that exhaustive search of large tabular corpora is impractical, underscoring the need for principled, scalable alternatives to current ad hoc approaches.

References

Contamination detection for tabular data lacks established best practices and row-level deduplication is insufficient (as we demonstrated), but comprehensive alternatives remain an open problem.

The Illusion of Generalization: Re-examining Tabular Language Model Evaluation  (2602.04031 - Gorla et al., 3 Feb 2026) in Section: Limitations