AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages (2311.09828v3)

Published 16 Nov 2023 in cs.CL

Abstract: Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

PDF Abstract

AfriMTE and AfriCOMET: Enhancing Machine Translation Metrics for African Languages

The paper "AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages" presents a thorough exploration into adapting the COMET metric to adequately assess machine translation (MT) quality in under-resourced African languages. The work introduces AfriMTE, a dataset dedicated to MT adequacy evaluation for African languages, and develops AfriCOMET, a state-of-the-art evaluation metric specifically designed for these languages.

Methodological Innovations

The paper begins by highlighting the limitations of current MT evaluation metrics, such as BLEU, which are inefficient for semantic-level assessments. Embedding-based metrics like COMET offer better correlations with human judgments but face challenges when applied to African languages due to data scarcity, complex annotation guidelines, and limited LLM coverage.

To overcome these hurdles, the researchers created AfriMTE, focusing on 13 diverse African languages. This dataset was devised using a simplified Multidimensional Quality Metrics (MQM) framework tailored for ease of use by non-expert evaluators, with adaptations that harmonize MQM principles with Direct Assessment (DA) scoring. This approach is crucial for enabling reliable human assessments despite the limited availability of expert evaluators.

Numerical Results and Model Development

AfriCOMET, leveraging AfroXLM-Roberta, achieves a Spearman-rank correlation of +0.406 with human judgments, positioning it as a benchmark for future MT evaluation systems. The model utilized transfer learning from high-resource languages and validated on African languages, demonstrating its efficacy in this context.

The research further illustrates the potential of the AfriMTE dataset to enhance domain-specific MT evaluation, as evidenced by improved performance over existing metrics like COMET22 on certain domain-specific tasks. The paper also explores the performance variations when incorporating African language-enhanced pre-trained models, underscoring the impact of pre-training diversity and language inclusivity on evaluation metrics.

Implications and Future Directions

Practically, the development of AfriMTE and AfriCOMET provides the resources and frameworks necessary for more accurate and culturally relevant MT evaluations in African languages. This advancement holds potential for enhancing communication and information dissemination in Africa, where linguistic diversity is vast.

Theoretically, the paper underscores the importance of tailoring AI and NLP tools to fit specific linguistic characteristics and end-user abilities, highlighting a broader trend towards inclusivity in AI development.

Future work could explore expanding the dataset to include more African languages and refining annotation guidelines further to reduce subjectivity. Additionally, cross-linguistic transferability of the developed models could be assessed on other under-resourced language groups, potentially generalizing the findings beyond the African continent.

In conclusion, this research marks a significant step in filling the linguistic gap in MT evaluation, providing a scalable approach for other low-resource scenarios. The open release of resources will undoubtedly catalyze ongoing research and technology development in under-represented linguistic communities worldwide.