2000 character limit reached
Bitext Mining for Low-Resource Languages via Contrastive Learning (2208.11194v1)
Published 23 Aug 2022 in cs.CL
Abstract: Mining high-quality bitexts for low-resource languages is challenging. This paper shows that sentence representation of LLMs fine-tuned with multiple negatives ranking loss, a contrastive objective, helps retrieve clean bitexts. Experiments show that parallel data mined from our approach substantially outperform the previous state-of-the-art method on low resource languages Khmer and Pashto.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.