Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity (2402.02633v1)

Published 4 Feb 2024 in cs.CL and cs.LG

Abstract: Fine-tuning and testing a multilingual LLM is expensive and challenging for low-resource languages (LRLs). While previous studies have predicted the performance of NLP tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors: the size of the fine-tuning corpus, the domain similarity between fine-tuning and testing corpora, and the language similarity between source and target languages. We employ classical regression models to assess how these factors impact the model's performance. Our results indicate that domain similarity has the most critical impact on predicting the performance of Machine Translation models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Eric Khiu (3 papers)
  2. Hasti Toossi (2 papers)
  3. David Anugraha (9 papers)
  4. Jinyu Liu (32 papers)
  5. Jiaxu Li (20 papers)
  6. Juan Armando Parra Flores (1 paper)
  7. Leandro Acros Roman (1 paper)
  8. A. Seza Doğruöz (20 papers)
  9. En-Shiun Annie Lee (17 papers)
Citations (2)