Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (1911.04944v2)

Published 10 Nov 2019 in cs.CL

Abstract: We show that margin-based bitext mining in a multilingual sentence space can be applied to monolingual corpora of billions of sentences. We are using ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences. Using one unified approach for 38 languages, we were able to mine 4.5 billions parallel sentences, out of which 661 million are aligned with English. 20 language pairs have more then 30 million parallel sentences, 112 more then 10 million, and most more than one million, including direct alignments between many European or Asian languages. To evaluate the quality of the mined bitexts, we train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Using our mined bitexts only and no human translated parallel data, we achieve a new state-of-the-art for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French. In particular, our English/German system outperforms the best single one by close to 4 BLEU points and is almost on pair with best WMT'19 evaluation system which uses system combination and back-translation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2019 workshop on Asian Translation (WAT).

Mining High-Quality Parallel Sentences from Monolingual Corpora: The CCMatrix Approach

The paper "CCMatrix: \Mining Billions of High-Quality Parallel Sentences on the WEB" represents a significant advancement in the field of multilingual NLP by demonstrating the scalability of margin-based bitext mining using a multilingual sentence space. This approach is applied to a corpus of monolingual data comprising billions of sentences, resulting in the extraction of a substantial number of parallel sentences across multiple languages.

Methodology

The authors leverage margin-based bitext mining within a multilingual sentence embedding space to identify parallel sentences from vast monolingual corpora. This technique is built upon the LASER (Language-Agnostic SEntence Representations) toolkit, which facilitates the creation of multilingual sentence embeddings. The embeddings are trained in a joint space that allows for the semantic proximity of translations across different languages.

A distinguishing facet of their methodology is the margin criterion employed to measure the similarity between sentence embeddings across languages, an approach that mitigates the inconsistencies observed with absolute cosine distance thresholds. This technique evaluates the cosine similarity of sentence embeddings and a specified number of their nearest neighbors, ensuring the robustness of detected alignments.

A significant innovation of this work is the deployment of a global mining strategy. Unlike conventional methods that rely on a hierarchical approach entailing preliminary document alignment followed by sentence-level parallelism checks, this research employs a comprehensive approach by analyzing fully monolingual texts for parallel sentences within and across various languages. This method circumvents assumptions about document structures and greatly expands the potential parallel data available for mining.

Data and Results

Using multilingual data from ten snapshots of the Common Crawl corpus, which included 32.7 billion unique sentences from 38 languages, the authors successfully mined 4.5 billion parallel sentences. Among these, 661 million sentences included alignments with English. The results notably include 20 language pairs yielding more than 30 million parallel sentences each — demonstrating strong performance even with language pairs that do not involve English.

The CCMatrix dataset uniquely covers many non-English language pairs directly, reflecting the authors' emphasis on supporting a broader range of multilingual interactions beyond English-centric communication. This exemplifies its value for tasks involving less-resourced languages and the provision of direct translations between languages traditionally mediated through English.

Evaluation

The quality of the mined bitexts was assessed by training Neural Machine Translation (NMT) systems and benchmarking them against established datasets such as TED, WMT, and the WAT test sets. Notably, NMT models trained solely on CCMatrix data outperformed state-of-the-art systems from WMT'19 for certain language pairs, reaching up to a 4 BLEU points improvement for English to German translation. Additionally, the results surpassed those of the best submissions in the WAT'19 workshop for Russian to Japanese language pairs, highlighting the dataset's utility for distant and structurally diverse language pairs.

Implications and Future Directions

The CCMatrix corpus introduces a paradigm shift in bitext mining, demonstrating the feasibility of deriving high-quality multilingual resources from vast web-based monolingual corpora. This approach significantly expands the potential for constructing comprehensive multilingual datasets, enabling improved machine translation capabilities across diverse language pairs.

Given these outcomes, future research may investigate the integration of CCMatrix data with other multilingual resources to further enhance translation systems. Additionally, extending the mining approach to additional languages and refining methods to handle even more extensive corpora are promising endeavors that could further enhance the richness and applicability of multilingual data resources in NLP.

This comprehensive research opens new pathways for leveraging web-crawled monolingual data to strengthen resources for underrepresented languages and facilitate advances in language technology, with global linguistic inclusivity in mind.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Holger Schwenk (35 papers)
  2. Guillaume Wenzek (12 papers)
  3. Sergey Edunov (26 papers)
  4. Edouard Grave (56 papers)
  5. Armand Joulin (81 papers)
Citations (235)