A New Massive Multilingual Dataset for High-Performance Language Technologies (2403.14009v1)

Published 20 Mar 2024 in cs.CL

Abstract: We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for LLMing and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

Authors (13)

Ona de Gibert (10 papers)
Graeme Nail (12 papers)
Nikolay Arefyev (17 papers)
Marta Bañón (3 papers)
Jelmer van der Linde (4 papers)
Shaoxiong Ji (39 papers)
Jaume Zaragoza-Bernabeu (4 papers)
Mikko Aulamo (6 papers)
Gema Ramírez-Sánchez (6 papers)
Andrey Kutuzov (41 papers)
Sampo Pyysalo (23 papers)
Stephan Oepen (8 papers)
Jörg Tiedemann (41 papers)

Citations (7)

View on Semantic Scholar

Summary

Introduction to the HPLT Language Resources

Overview of the HPLT Language Resources

The High Performance Language Technologies (HPLT) project introduces a new dataset for LLMing and machine translation (MT) training, encompassing one of the largest publicly available multilingual text corpora. This dataset includes both monolingual and parallel corpora extracted from the web, leveraging web crawls produced by the Internet Archive and CommonCrawl. The project also releases a suite of open-source tools and models aligned with the dataset to facilitate processing and application of the resources.

Dataset Composition

The HPLT language resources encompass:

MonoHPLT: A monolingual dataset covering 75 languages, with over 5.6 trillion word tokens. This part of the dataset emphasizes low- to medium-resourced languages.
BiHPLT: A parallel dataset focusing on English-centric language pairs, covering 18 language pairs and more than 96 million aligned sentence pairs.
MultiHPLT: Synthetic datasets created by pivoting parallel datasets through English, covering 171 language pairs with 157 million sentence pairs.
The project also releases 22 Machine Translation models for bilingual document alignment and 9 Bicleaner AI models for sentence pair scoring.

Highlights and Contributions

The HPLT language resources present several notable contributions:

Extensive Language Coverage: The dataset significantly contributes to the diversity of languages available for language technology development, particularly enhancing resources for low-resourced languages.
Massive Scale: With trillions of word tokens across the monolingual datasets and hundreds of millions of aligned sentence pairs in the parallel corpus, the data volume is unprecedented among publicly released resources.
Open Tools: Accompanying the dataset, the project releases a range of tools for managing, downloading, and processing large web-crawled corpora, enabling researchers to extend or replicate the dataset compilation process.
Innovative Use of Web Crawls: The dataset incorporates previously unused web crawls from the Internet Archive, providing new text resources that were not available in other web-derived corpora.

Practical and Theoretical Implications

The availability of the HPLT language resources under a permissive CC0 license opens several avenues for research and development:

Training and Evaluation of LLMs: The sheer scale and diversity of the monolingual datasets offer a robust foundation for training LLMs, particularly in incorporating and evaluating low-resourced languages.
Advancements in MT: The parallel corpus, especially when considered alongside the synthetic datasets, presents significant resources for training and improving machine translation models across a wide range of language pairs.
Research in Data Compilation Techniques: The methodology applied in assembling the datasets, from web crawls to dataset processing and tooling, provides a valuable blueprint for future efforts in compiling large-scale language resources.

Future Directions

While the current release of the HPLT language resources marks a significant milestone, future developments are anticipated to expand language coverage further, enhance the dataset with more granular metadata, and extend tools for even more efficient processing. Additionally, the project aims to contribute models and training pipelines, enriching the ecosystem around the dataset.

Concluding Remarks

The HPLT language resources demonstrate the potential of leveraging web-derived data to create extensive, diverse, and accessible datasets for language technology research and development. By making these resources publicly available, the project not only facilitates immediate advancements in LLMing and machine translation but also sets the stage for future innovations in the field.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1779176692596052192

https://twitter.com/hplt_eu/status/1772287173812768874

https://twitter.com/susumuota/status/1779299903484617124

https://twitter.com/GregKamradt/status/1886843229599096855