Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models (2204.08110v4)

Published 17 Apr 2022 in cs.CL

Abstract: English pretrained LLMs, which make up the backbone of many modern NLP systems, require huge amounts of unlabeled training data. These models are generally presented as being trained only on English text but have been found to transfer surprisingly well to other languages. We investigate this phenomenon and find that common English pretraining corpora actually contain significant amounts of non-English text: even when less than 1% of data is not English (well within the error rate of strong language classifiers), this leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them, with target language performance strongly correlated to the amount of in-language data seen during pretraining. In light of these findings, we argue that no model is truly monolingual when pretrained at scale, which should be considered when evaluating cross-lingual transfer.

PDF Abstract

Analysis of Language Contamination in English Pretrained Models and Their Cross-Lingual Capabilities

The paper "Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models" offers a comprehensive examination of English-pretrained LLMs and their unforeseen aptitude for cross-lingual tasks, despite being predominantly trained on English data. This paper elucidates the presence of non-English text within English training corpora, often referred to as "language contamination," and elucidates its role in enhancing the cross-lingual transfer abilities of these models.

Key Findings

Language Contamination Detection: Contrary to the assumption that English-pretrained models solely see English text, the authors demonstrate significant quantities of non-English data within these corpora. Even when constituting less than 1% of the dataset—considered within the error threshold for robust language classifiers—this percentage translates to hundreds of millions of tokens. Such contamination primarily derives from datasets accumulated via web crawling.
Impact on Cross-Lingual Transfer: Through empirical evaluation, the paper establishes that the cross-lingual performance of models is strongly tied to the volume of target language data encountered during pretraining. This refutes earlier assumptions that monolingual models operated as zero-shot learners for unseen languages. Instead, the leaked, albeit small, amounts of non-English text provide substantial cross-lingual signal.
Performance Benchmarks: The evaluation process involved masked LLMing and part-of-speech (POS) tagging across 50 languages. Interestingly, English-trained models like T5 surpassed mBERT in POS tagging tasks for several languages, illustrating the efficacy of even unintended multilingual exposure.
Observed Language Composition: The automatic classification uncovered non-English tokens in various English-pretraining datasets, with web-crawled data showing higher contamination rates. Methods such as manual filtering, exemplified by Wikipedia, showed less non-English presence compared to automatic methods.

Practical and Theoretical Implications

Practically, the paper prompts reconsideration of the "monolingual" pretraining narrative, encouraging more transparency and precision in model evaluation, especially for cross-lingual applications. Theoretically, the findings suggest innate multilingualism within models commonly labeled as monolingual, calling attention to the influence of modest multilingual exposure.

Future Directions

Future research may benefit from explicitly tailoring pretraining methods to harness and optimize the cross-lingual capabilities induced by such contamination, focusing on aligning tokenizer design with multilingual processing needs. Additionally, attempts to systematically control and verify the language composition of training datasets could lead to more refined and effective models.

Conclusion

This investigation conclusively illustrates that non-English data within English-pretrained corpora considerably contributes to cross-lingual model competence. This observation runs counter to prior beliefs regarding zero-shot transfer capabilities, indicating that even minor multilingual exposure during pretraining can heavily influence model performance in foreign languages. Understanding and leveraging such dynamics could promote further advancements in natural language processing across global languages.