Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia (1907.05791v2)

Published 10 Jul 2019 in cs.CL

Abstract: We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 85 languages, including several dialects or low-resource languages. We do not limit the the extraction process to alignments with English, but systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 1620 different language pairs, out of which only 34M are aligned with English. This corpus of parallel sentences is freely available at https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.

Essay on "WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia"

The paper "WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia" presents a robust methodology for extracting parallel sentences across a vast number of language pairs from Wikipedia. The researchers leveraged multilingual sentence embeddings, specifically the LASER toolkit, to mine 135 million parallel sentences across 1620 language pairs without restricting alignments to English.

Methodology and Approach

The researchers employed multilingual sentence embeddings to navigate the challenges of mining parallel corpora from Wikipedia, making this the first paper to systematically assess all language pair combinations within the entire Wikipedia corpus. Utilizing the LASER toolkit, a language-agnostic sentence encoder, they processed around 879 million sentences in 182 languages after filtering out duplicates and ensuring language consistency.

To effectively mine parallel corpora, the authors adopted a margin-based criterion that evaluates candidate translations based on the cosine distance between sentence embeddings. This approach was selected over absolute thresholds to enhance alignment quality across diverse languages.

Numerical Results and Evaluation

The extracted corpus comprises 135 million parallel sentences spanning 85 languages. Notably, 34 million of these sentences are aligned with English. The paper emphasizes that a significant portion of the extracted sentences involves language pairs that do not include English, thus facilitating direct translations between distant languages.

For the evaluation of the mined data's quality, neural machine translation (NMT) systems were trained solely on these extracted texts for 1886 language pairs using a subset of the TED corpus. The NMT systems yielded strong BLEU scores, showcasing the effectiveness of the mined data for language pairs such as German/English and Czech/French.

Implications and Future Directions

The implications of this research are significant for several aspects of NLP, particularly in multilingual machine translation. The availability of such a comprehensive multilingual corpus can drastically improve translation systems, especially for low-resource and dialectal languages.

Practically, these findings could lead to the development of more efficient translation systems without pivoting through English, thus preserving translation accuracy in multilingual contexts. Theoretically, this paper paves the way for further exploration of multilingual embeddings and their applications in unearthing linguistic patterns across languages.

Future research may focus on refining the multilingual embeddings by iteratively retraining them with the mined data to improve performance for low-resource languages. Additionally, applying this mining methodology to other multilingual sources such as Common Crawl could expand the repository of available parallel corpora.

In conclusion, "WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia" contributes substantially to the field of multilingual NLP by providing a rich resource that supports the development and enhancement of translation models across a diverse set of languages, including those with limited resources.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Holger Schwenk (35 papers)
  2. Vishrav Chaudhary (45 papers)
  3. Shuo Sun (91 papers)
  4. Hongyu Gong (44 papers)
  5. Francisco Guzmán (39 papers)
Citations (381)
Github Logo Streamline Icon: https://streamlinehq.com