Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval (2505.19356v2)

Published 25 May 2025 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: Neural retrieval methods using transformer-based pre-trained LLMs have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.

Summary

The paper introduces optimized, language-specific text embedding models and benchmarks for Amharic passage retrieval.
The RoBERTa-Base-Amharic-Embed model achieved a 17.6% relative improvement in MRR@10 compared to the leading multilingual baseline.
A ColBERT model obtained the highest MRR@10 score (0.843), illustrating the effectiveness of token-level interaction for Amharic retrieval.

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

The paper presents a focused exploration into neural information retrieval (IR) methods designed for low-resource, morphologically rich languages, particularly Amharic. This paper is significant due to the scarcity of comprehensive IR models tailored for languages like Amharic, where morphological complexity and limited data availability pose considerable challenges.

The authors introduce a series of Amharic-specific dense retrieval models using pre-trained Amharic BERT and RoBERTa as foundational backbones. The most notable among these is the RoBERTa-Base-Amharic-Embed model, which demonstrates a 17.6% relative improvement in Mean Reciprocal Rank at rank 10 (MRR@10) and a 9.86% enhancement in Recall@10 compared to the best-performing multilingual baseline, Arctic Embed 2.0. Moreover, smaller models like RoBERTa-Medium-Amharic-Embed maintain competitive performance while being significantly compact, which is beneficial for deployment in resource-constrained environments.

A ColBERT-based late interaction model was also developed, achieving the highest MRR@10 score (0.843) among all tested models, indicating the efficacy of token-level interactions in improving retrieval accuracy. The authors benchmark their Amharic models against both sparse and dense retrievers, elucidating the benefits and limitations of different retrieval paradigms in low-resource settings.

Key numerical results support the contributions of the research:

RoBERTa-Base-Amharic-Embed: Achieves 0.775 MRR@10 and 0.913 Recall@10, a substantial improvement over established multilingual benchmarks.
ColBERT Model: Attains 0.843 MRR@10, illustrating the potential of token-level interaction models in capturing nuanced semantics.
Efficiency: Smaller models like the RoBERTa-Medium-Amharic-Embed (42M parameters) outperform larger multilingual models, underscoring the importance of language-specific adjustments rather than brute-force scaling.

The implications are multifaceted. Practically, the models and benchmarks facilitate more accurate and robust retrieval systems for Amharic, offering improvements in applications such as question answering and fact verification that rely heavily on precise document retrieval. Theoretically, the research underscores the importance of developing language-specific adaptations to optimize retrieval tasks in linguistically complex and under-resourced environments. This paper also tells that enhancing tokenization processes to better handle morphological variability can significantly impact retrieval effectiveness, as shown through subword fertility metrics.

Looking ahead, further advancements may include refining tokenization strategies to improve semantic coherence and using more diverse datasets to train these models for better domain adaptation. Additionally, creating more comprehensive evaluation frameworks with diverse and expertly annotated datasets would offer a deeper understanding and enhancement of retrieval performance across varied contexts and languages.

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval (2505.19356v2)

Summary

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Related Papers

GitHub

YouTube