Introduction
Cross-lingual transfer learning is a powerful tool for extending applications to low-resource languages without labeled data. Though multilingual LLMs facilitate zero-shot learning, their performance on granular tasks like Named Entity Recognition (NER) or Event Argument Extraction (EAE) is generally subpar compared to supervised models fine-tuned on labeled data. Traditional label projection techniques translate labeled data from a resource-rich language and align it to the low-resource language data. Alignment-based methods preserve translation quality but lag in accuracy, while marker-based methods lead in accuracy but compromise translation fidelity by injecting markers before translation.
Constrained Decoding for Label Projection
A novel approach, Constraint Decoding for Cross-lingual Label Projection (CODEC), addresses quality degradation while maintaining the accuracy benefits of marker-based projection. CODEC translates training data sans markers before inserting them in a second pass via a custom decoding algorithm. This maintains high translation quality, vital for accurate projection. CODEC's constrained decoding algorithm ensures that only likely marker positions and hypotheses with the correct number of labels are considered.
Experimental Results
CODEC was evaluated across 20 languages for NER and EAE tasks. It showed marked improvements over state-of-the-art methods. For example, in the NER task, CODEC outperformed the established EasyProject method by a wide margin, with particularly significant uplifts in underperforming languages like chiShona. In the EAE task, CODEC demonstrated notable success, outperforming alignment-based projection methods across Arabic and Chinese, showcasing the efficacy of its constrained decoding strategy.
Algorithm Efficiency
Further efficiency is achieved by approximating the multi-projection issue and pruning unlikely marker positions to expedite decoding. CODEC operates effectively even with large sequence lengths and numerous labeled spans. Its design is optimized to remove branches early in the search and uses heuristics to aggressively reduce decoding time with minimal performance impact.
Conclusion
Prompted by multilingual label projection’s challenges, CODEC presents a versatile, constraint-based decoding methodology that balances translation quality with the precision of label projection. It improves cross-lingual tasks by translating without markers initially, preserving text integrity, and performing constrained decoding to insert markers later. Experiments show that it outperforms existing methods, setting a new bar for cross-lingual label projection. The paper’s insights and proposed technique, CODEC, embody a significant advancement in multilingual NLP, potentially revolutionizing data augmentation for low-resource languages.