Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings (1811.01136v2)

Published 3 Nov 2018 in cs.CL, cs.AI, and cs.LG

Abstract: Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC mining task and the UN reconstruction task by more than 10 F1 and 30 precision points, respectively. Filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.

Authors (2)

Mikel Artetxe (52 papers)
Holger Schwenk (35 papers)

Citations (192)

View on Semantic Scholar

Summary

Margin-based Parallel Corpus Mining Using Multilingual Sentence Embeddings

The paper presented by Artetxe and Schwenk proposes a novel approach to mining parallel corpora by employing margin-based scoring methods alongside multilingual sentence embeddings. The focal issue addressed is the sensitivity of machine translation models to training data quality. The authors argue that classical methods relying on hard cosine similarity thresholds are insufficient due to scale inconsistencies inherent in this similarity measure, which can lead to suboptimal retrieval of parallel sentence pairs.

Methodology Overview

The proposed technique diverges from conventional nearest neighbor retrieval by considering the margin between a given sentence pair and its closest competitors. This margin-based approach aims to resolve the limitations of cosine similarity, particularly its susceptibility to scale variations across different sentences. In practical terms, the margin score calculates the difference (or ratio) between the similarity of a candidate sentence pair and the average similarity of the candidate's $k$ nearest neighbors.

Experimental Results

The method was tested across several benchmarks: the BUCC mining task, UN corpus reconstruction, and filtering for machine translation purposes using the ParaCrawl corpus.

BUCC Mining Task: Artetxe and Schwenk report outstanding performance gains, exceeding existing methods by upwards of 10 F1 points in multiple languages, pinpointing the robustness and scalability of their margin-based scoring approach.
UN Corpus Reconstruction: The margin-based scoring demonstrated precision improvements, achieving more than 80% precision in reconstructing sentence pairs, significantly surpassing previous approaches.
ParaCrawl Filtering for NMT: When applied as a filter for crawling parallel corpora, the proposed method yielded considerable improvements in BLEU scores for English-German translation tasks, outperforming the best official filtered corpora by over one BLEU point.

Implications and Future Directions

This research holds significant implications for the extraction and alignment of parallel sentences, offering an enhanced method for creating high-quality training datasets for machine translation models. Such a technique is crucial in contexts where large volumes of unstructured multilingual data are available, providing superior filtering capabilities for downstream translation applications.

Theoretically, the margin-based approach could spur further experimentation with sentence embedding methods, encouraging exploration beyond traditional performance thresholds and cosine similarity metrics. On the practical side, the methodology augments the prospects for multilingual NLP tasks, potentially broadening the horizons for improved machine translation systems across a wider array of languages.

Future developments could involve deepening the exploration of margin-based scoring in various NLP tasks or integrating this method with broader cross-lingual systems. Additionally, extensions to the multilingual encoder to effectively handle languages with limited parallel data may pave the way for more inclusive computational linguistic models, addressing gaps in lesser-studied linguistic domains.

Conclusion

Artetxe and Schwenk's contribution through margin-based parallel corpus mining using multilingual sentence embeddings sets a notable precedent in NLP research, particularly within the machine translation sphere. Through rigorous evaluations and substantial performance improvements reported across critical benchmarks, the paper validates the efficacy of considering margin metrics over traditional similarity measures, offering a potent alternative in the quest for optimizing parallel data extraction.

PDF Markdown