MADLAD-400: A Multilingual And Document-Level Large Audited Dataset (2309.04662v1)

Published 9 Sep 2023 in cs.CL and cs.LG

Abstract: We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter LLM, and assess the results on few-shot translation. We make the baseline models available to the research community.

PDF Abstract

Large-Scale Multilingual Dataset Development and Machine Translation Models

The paper introduces a comprehensive multilingual dataset generated through a meticulously audited approach, spanning 419 languages with a focus on document-level monolingual data derived from CommonCrawl. The dataset, internally referred to as MADLAD, has been crafted with rigorous auditing and filtering procedures to address the pervasive noise and content quality challenges often associated with web-scale corpora.

Dataset Construction and Auditing

The creation of MADLAD entailed mining data from CommonCrawl employing a document-level LangID model fine-tuned over 498 languages, resulting in a raw dataset of 5 trillion tokens. The inherent quality issues of such a vast dataset cannot be overstated; issues such as irrelevant content, misalignment, and mislabeled linguistic data are widespread concerns. To mitigate these, the paper describes an extensive manual audit conducted on various language subsets where domain experts reviewed language samples to identify languages not conducive to quality data or plagued by excessive noise. The authors implemented various filtering measures including document-level refinement and augmentation of language-specific filters to remove irrelevant content, thereby pruning the dataset to 3 trillion clean tokens across 419 languages.

Multilingual Machine Translation Models

Based on the refined dataset, the authors developed and trialed multiple machine translation (MT) models, encompassing up to 10.7B parameters, trained across 250 billion tokens covering over 450 languages. These models were evaluated using highly multilingual translation datasets offering compelling results. Notably, the MT models demonstrated competitiveness against models of significantly greater scale. The meticulous preprocess allowed these comparably smaller models to achieve high efficiency, even in the context of lower-resource languages, which underscores the potential of robust data curation strategies in leveraging smaller-scale model architectures effectively.

LLMs and Few-Shot Learning

Further exploration was conducted on 8B parameter LLMs, which were assessed in few-shot translation tasks. Although the effectiveness of these models showed improvement with more shots, the authoritative outcome remained comparatively lower than supervised counterparts, notably on cross-domain language translation tasks. This reinforces the need for model refinement and more extensive training data handling in unsupervised or minimally supervised learning paradigms.

Evaluation and Challenges

The paper provides a thorough evaluation against benchmarks such as WMT, Flores-200, NTREX, and Gatones to validate the strategic impact of their dataset and models. The authors highlighted several potential uses of the dataset along with identifying and testing various prediction scenarios among different language pairs’s translation quality. Crucially, the research also tackles memorization phenomena in models, revealing certain memorization tendencies that pose complex issues in data handling and model design, especially for publicly available datasets.

Implications and Future Directions

The outcomes from this dataset and model development have pragmatic implications for the NLP community, particularly in fostering language inclusivity and supporting underrepresented languages in computational modalities.

This paper's methodological insights and procedural rigor can potentially stimulate advances in AI model design focused on balanced performance across diverse linguistic landscapes. Future exploration may include enhancing model architectures to better harness varying data dimensions, mitigating bias in multilingual neural models, and developing advanced domain-specific languages through induced datasets. In essence, such datasets and model capabilities are stepping stones towards a more linguistically inclusive and operationally efficient field of AI.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Sneha Kudugunta (14 papers)
Isaac Caswell (19 papers)
Biao Zhang (76 papers)
Xavier Garcia (36 papers)
Christopher A. Choquette-Choo (49 papers)
Katherine Lee (34 papers)
Derrick Xin (3 papers)
Aditya Kusupati (28 papers)
Romi Stella (1 paper)
Ankur Bapna (53 papers)
Orhan Firat (80 papers)

Citations (87)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/yentinglin56/status/1790304487489606027

https://twitter.com/bj2rn/status/1790414577710330173