Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation (2106.03193v1)

Published 6 Jun 2021 in cs.CL and cs.AI

Abstract: One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a high-quality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.

The Flores-101 Evaluation Benchmark for Machine Translation

The paper introduces the Flores-101 benchmark, a comprehensive evaluation dataset designed to advance research in low-resource and multilingual machine translation (MT). The benchmark comprises 3001 sentences from English Wikipedia, translated into 101 languages by professional translators. Flores-101 aims to provide a high-quality resource covering a wide array of topics and domains, addressing the notable scarcity of evaluation tools for under-resourced languages.

Dataset Overview

Flores-101 offers multilingual alignment for all translations, supporting the evaluation of many-to-many translation systems. This diverse dataset allows researchers to compare translation quality across 101 languages, facilitating improvements in low-resource MT. Additionally, the benchmark supports various evaluation tasks, including document-level and multimodal translation, due to the rich metadata associated with each sentence.

Methodology and Dataset Construction

The dataset construction consisted of multiple phases. Initially, sentences were sourced from diverse domains such as WikiNews, Wikijunior, and WikiVoyage. Pilot experiments determined the most efficient translation and quality assurance workflows. The translation process involved checks for translation artifacts, such as copying from machine translation outputs, ensuring the dataset's integrity.

Evaluation Methodology

A novel metric, SentencePiece BLEU (spBLEU), was proposed to evaluate translation models across all languages without the need for custom tokenization rules. This metric was designed to be robust and scalable for any language, addressing the diverse linguistic challenges presented by a multilingual dataset like Flores-101.

Experimental Results

The paper reports on the evaluation of several baseline models, including M2M-124 and OPUS-100, using the Flores-101 benchmark. The results highlight the difficulty of translating into low-resource languages, with significant variation in translation quality based on the amount of available bitext data. The dataset revealed that translating into and out of English yields higher quality translations than translating between other languages, a finding that underscores the need for improved multilingual translation strategies.

Implications and Future Directions

The release of Flores-101 comes with significant implications for the field of MT. By enabling the evaluation of models on a wide variety of languages and domains, Flores-101 provides a platform for advancing research on regional and global language pairs, including those that are traditionally overlooked. The benchmark's design and the proposed spBLEU metric facilitate consistent and fair comparisons across models, promoting advancements in low-resource language translation.

The dataset is expected to foster significant progress in MT research, encouraging the development of models that can handle the linguistic diversity and richness of global languages. Future work may expand Flores-101's language coverage, supporting even more languages and potentially incorporating new translation approaches that further explore multilingual alignments.

Flores-101 represents an important step forward in creating comprehensive evaluation resources for MT. By addressing the challenges of low-resource language evaluation, it paves the way for innovations in multilingual translation, ultimately contributing to a more inclusive digital landscape.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Naman Goyal (37 papers)
  2. Cynthia Gao (9 papers)
  3. Vishrav Chaudhary (45 papers)
  4. Peng-Jen Chen (26 papers)
  5. Guillaume Wenzek (12 papers)
  6. Da Ju (18 papers)
  7. Sanjana Krishnan (2 papers)
  8. Marc'Aurelio Ranzato (53 papers)
  9. Angela Fan (49 papers)
  10. Francisco Guzman (12 papers)
Citations (481)