The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks (2505.10507v1)

Published 15 May 2025 in cs.CL

Abstract: Translation-based strategies for cross-lingual transfer XLT such as translate-train -- training on noisy target language data translated from the source language -- and translate-test -- evaluating on noisy source language data translated from the target language -- are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

Summary

The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

Translation-based cross-lingual transfer (XLT) has emerged as a practical approach to adapt multilingual LLMs (mLMs) across different languages, especially for token classification tasks. This paper by Ebing and Glavaš revisits the usage of word aligners (WAs) for label projection in translation-based XLT, comparing their performance against newer marker-based approaches. The researchers present a detailed analysis of the design choices in WAs and introduce a novel ensemble method that combines translate-train (TT) and translate-test (XT) strategies, further boosting the efficacy of WAs for XLT on token classification tasks.

The research provides a systematic examination of low-level design decisions pertaining to label projection via WAs. Three main components are evaluated: span mapping algorithms, filtering strategies, and pre-tokenization of translated sentences. The findings indicate that these components significantly impact the effectiveness of WAs, particularly for the translate-test approach. For translate-train, the translation-based XLT performance seems relatively robust to these low-level design decisions. Specifically, language-specific pre-tokenization and optimized filtering strategies result in notable improvements in XLT performance.

In assessing WAs against marker-based methods for translation-based XLT, the paper demonstrates that, when optimized, WAs achieve performance comparable to, and sometimes even surpassing, state-of-the-art marker-based methods such as EasyProject and Codec. Additionally, the researchers propose a sophisticated ensembling method that combines outputs from both translate-train and translate-test. This ensemble not only enhances performance but also mitigates sensitivity to WA design decisions, providing a more robust solution for cross-lingual token classification tasks.

The paper is implemented across diverse datasets covering 29 languages, including low-resource languages, underscoring the broad applicability of the findings. The impact of WA model selection is also examined, with AccAlign outperforming other WAs like Awesome on average. The paper further explores the utility of commercial machine translation (MT) systems like Google Translate, finding modest gains over publicly available models such as NLLB, especially in translate-test applications.

From both practical and theoretical perspectives, this work sheds light on the intricacies of translation-based XLT and the nuances that influence its success. Practically, the proposed ensemble approach, coupled with optimized WA configurations, presents a viable path for enhancing token classification in multilingual contexts. Theoretically, the work informs the ongoing discourse on the balance between translation quality and alignment precision in cross-lingual NLP.

In summary, the research contributes significant insights into the potential of WAs for token-level XLT, challenging assumptions that newer marker-based methods are inherently superior. This invites further exploration into the development of more refined alignment techniques and encourages the community to consider nuanced design factors when working with translation-based cross-lingual models. Future research could explore the integration of these findings into broader multilingual NLP applications, potentially elevating the state-of-the-art in cross-lingual understanding and generation tasks.