Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Contamination Can Cross Language Barriers (2406.13236v2)

Published 19 Jun 2024 in cs.CL and cs.AI
Data Contamination Can Cross Language Barriers

Abstract: The opacity in developing LLMs is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{https://github.com/ShangDataLab/Deep-Contam}.

Cross-Lingual Data Contamination in LLMs

This paper addresses a critical issue in the development of LLMs: the potential contamination of public benchmarks used in model evaluation. Traditional methods of detecting data contamination often rely on identifying overlapping text between training and evaluation datasets to expose potential memorization. However, these strategies may fail to capture more insidious forms of contamination that exploit cross-lingual capabilities, a gap that this research seeks to fill.

Key Contributions

The authors introduce a novel form of contamination—cross-lingual contamination—where an LLM is overfitted on translated versions of benchmark test sets. This technique artificially inflates model performance on original English benchmarks while evading existing detection methods, which are largely overseen by n-gram duplication checks.

To counter this challenge, the paper proposes generalization-based detection methodologies. This method evaluates the model's performance changes when the benchmark's questions are altered by replacing incorrect choices with correct options from other questions. Contaminated models, which rely on memorization rather than understanding, struggle to generalize in these scenarios as all remembered choices appear correct.

Experimental Validation

The research experimentally validated the phenomenon across various LLMs, including LLaMA3-8B and Qwen1.5-7B. They observed substantial performance inflations in popular benchmarks such as MMLU, ARC Challenge, and MathQA when the models were trained on translated test sets in seven languages. Notably, traditional detection methods, including shared likelihood and guided prompting, failed to unearth this deep contamination, whereas their proposed detection technique successfully identified such discrepancies.

Implications

The identification of cross-lingual contamination highlights vulnerabilities in current LLM evaluation protocols, emphasizing the need for robust detection strategies that go beyond superficial memorization checks. By focusing on a model's generalization capacity, the authors provide a framework that not only detects contamination but also offers insight into the model's learning pathways.

Practical and Theoretical Implications

Practically, this research invites a reevaluation of how benchmarks are used in LLM training and evaluation. Theoretically, it suggests a shift towards understanding the distributed knowledge representation within models beyond mere textual surface forms. It also points to larger implications regarding the interpretability of multilingual models, suggesting that language may act as a varied interface through which shared knowledge is processed and accessed.

Future Directions

The paper leaves avenues open for future research in extending generalization-based methods to other forms of contamination and exploring the underlying mechanisms that allow cross-lingual contamination to affect multilingual LLMs' capabilities. Further exploration into the role of language as an interface could refine understanding and training methodologies, optimizing model performance across diverse linguistic contexts.

In conclusion, this research underscores the necessity for advanced contamination detection methodologies, fostering both theoretical advancements and practical safeguards in the ongoing development and deployment of LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Feng Yao (27 papers)
  2. Yufan Zhuang (16 papers)
  3. Zihao Sun (5 papers)
  4. Sunan Xu (1 paper)
  5. Animesh Kumar (22 papers)
  6. Jingbo Shang (141 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com