Word Alignment by Fine-tuning Embeddings on Parallel Corpora (2101.08231v4)

Published 20 Jan 2021 in cs.CL

Abstract: Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs. The great majority of past work on word alignment has worked by performing unsupervised learning on parallel texts. Recently, however, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained LLMs (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data. In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing methods to effectively extract alignments from these fine-tuned models. We perform experiments on five language pairs and demonstrate that our model can consistently outperform previous state-of-the-art models of all varieties. In addition, we demonstrate that we are able to train multilingual word aligners that can obtain robust performance on different language pairs. Our aligner, AWESOME (Aligning Word Embedding Spaces of Multilingual Encoders), with pre-trained models is available at https://github.com/neulab/awesome-align

PDF Abstract

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

The paper, "Word Alignment by Fine-tuning Embeddings on Parallel Corpora," proposes an innovative methodology for word alignment tasks leveraging both pre-trained LLMs (LMs) and parallel corpora. This paper suggests a middle ground between traditional unsupervised learning methods on parallel texts and more recent approaches utilizing contextualized word embeddings from multilingually trained LMs, such as BERT (Devlin et al. 2019) and its multilingual counterparts (Conneau and Lample, 2019).

Approach and Methodologies

The primary contribution of this research is the proposed model, AWESoME (Aligning Word Embedding Spaces of Multilingual Encoders), which significantly enhances word alignment performance. By fine-tuning pre-trained multilingual LMs on parallel text, the authors aim to improve alignment quality using specific alignment-oriented training objectives. The major objectives incorporated in the fine-tuning process include:

Self-training Objective (SO): This objective refines the alignment by encouraging semantically similar words in aligned sentences to have closely positioned embeddings.
Parallel Sentence Identification (PSI): This contrastive objective improves the model by distinguishing parallel from non-parallel sentence pairs, fostering semantic proximity of parallel sentences at the sentence representation level.
Consistency Optimization (CO): This objective ensures symmetry between forward and backward alignments, thus refining the bilaterality of word alignments.
Translation LLMing (TLM): By concurrently training on source and target sentences, the model captures cross-linguistic syntactic and semantic nuances, which are crucial for high-quality word alignment.
Masked LLMing (MLM): Although less influential, MLM assists in fine-tuning the model specifically to the domain of the task dataset.

The model extracts word alignments from contextualized embeddings using two main methods: probability thresholding (with softmax and its sparse variant $\alpha$ -entmax) and optimal transport. The paper shows that probability thresholding, especially when using the softmax function, outperforms optimal transport in both performance and computational efficiency.

Experimental Evaluation

The authors of the paper conducted extensive experiments across five language pairs: German-English, French-English, Romanian-English, Japanese-English, and Chinese-English. The AWESoME model consistently outperformed previous state-of-the-art models in these settings. Importantly, the model showed robust zero-shot performance, making it versatile and deployable across various language settings without needing specific parallel corpus training.

Moreover, the paper explored the model's effectiveness in semi-supervised environments, illustrating that incorporating even limited supervised signals significantly enhances alignment quality. The authors also tested the model's application in cross-lingual annotation projection tasks, such as in Named Entity Recognition (NER), demonstrating its practical value in real-world NLP applications.

Implications and Future Work

The AWESoME model's strong performance and adaptability highlight its potential as a useful tool for NLP tasks reliant on word alignment, such as machine translation, cross-lingual information extraction, and transfer learning of language processing tools. Additionally, the model's zero-shot capabilities expand its applicability to low-resource languages where parallel corpora might not be as readily available.

Future research directions could include exploring more sophisticated training objectives to leverage the growing capabilities of LMs. Analyzing the impacts of different LMs on alignment tasks and extending the methodologies to even larger and more diverse language pairs is another avenue for further investigation.

In summary, this paper advances the field by merging the robustness of pre-trained LMs with the fine-tuning capabilities of parallel corpora, setting a new benchmark for word alignment tasks within cross-lingual NLP research.