SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings (2004.08728v4)

Published 18 Apr 2020 in cs.CL

Abstract: Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners, even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

PDF Abstract

SimAlign: Word Alignments Without Parallel Data

The paper "SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings" investigates techniques for generating word alignments essential in statistical machine translation (SMT) and advantageous in neural machine translation (NMT), especially when imposing priors on attention matrices or for tasks like cross-lingual annotation projection. Traditional statistical word aligners such as Giza++ and eflomal, while effective, depend heavily on parallel training data and see quality degradation with reduced data availability.

Key Contributions

The primary innovation of this research is in developing word alignment methodologies that do not require parallel training data. Instead, SimAlign leverages multilingual word embeddings—both static and contextualized—sourced solely from monolingual data. This novel approach offers potential especially in low-resource and domain-specific situations where parallel data is scarce or absent.

Word Embeddings: The paper exploits multilingual embeddings constructed without parallel data or dictionaries. Static word embeddings are generated using fastText, while contextualized variants are derived from models like multilingual BERT (mBERT) and XLM-RoBERTa.
Methods Utilized:
- Argmax: Aligns words based on maximum similarity, a simple yet effective baseline.
- IterMax: An iterative refinement of Argmax introducing a mechanism to refine alignments by encouraging focus on unaligned words.
- Match: Leverages bipartite graph matching to seek global optimal alignments in a similarity matrix.
Empirical Results: Evaluations conducted across language pairs (e.g., English-German) reveal that the proposed alignment methods, utilizing multilingual pretrained models, outperformed strong statistical aligners like eflomal by up to 5 percentage points in F1 score even in scenarios with abundant parallel data.

Implications and Future Directions

The implications of SimAlign are significant. By eliminating the dependency on parallel data, the methods proposed offer a pathway to enhancing machine translation and related tasks in contexts with limited labeled resources. This paves the way for more inclusive and varied language processing applications beyond high-resource languages.

The results suggest potential future directions:

Exploration of Fertility Models: Investigation into explicit modeling of fertility may improve alignment performance further.
Integration with Parallel Data: Examining how hybrid approaches that combine parallel data with unsupervised embedding models might enhance results.

Overall, SimAlign marks a comprehensive step forward in word alignment methodologies, offering improved precision and recall with the ability to adapt to variable resource environments without parallel data reliance. The provision of an open-source tool (SimAlign) broadens accessibility for further research and practical deployments in natural language processing.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Masoud Jalili Sabet (12 papers)
Philipp Dufter (21 papers)
François Yvon (49 papers)
Hinrich Schütze (250 papers)

Citations (211)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - cisnlp/simalign: Obtain Word Alignments using Pretrained Language Models (e.g., mBERT) (363 stars)

Tweets

https://twitter.com/joabingel/status/1610650631504646146