Leveraging Domain Adaptation and Data Augmentation to Improve Qur'anic IR in English and Arabic (2312.02803v1)

Published 5 Dec 2023 in cs.CL and cs.AI

Abstract: In this work, we approach the problem of Qur'anic information retrieval (IR) in Arabic and English. Using the latest state-of-the-art methods in neural IR, we research what helps to tackle this task more efficiently. Training retrieval models requires a lot of data, which is difficult to obtain for training in-domain. Therefore, we commence with training on a large amount of general domain data and then continue training on in-domain data. To handle the lack of in-domain data, we employed a data augmentation technique, which considerably improved results in MRR@10 and NDCG@5 metrics, setting the state-of-the-art in Qur'anic IR for both English and Arabic. The absence of an Islamic corpus and domain-specific model for IR task in English motivated us to address this lack of resources and take preliminary steps of the Islamic corpus compilation and domain-specific LLM (LM) pre-training, which helped to improve the performance of the retrieval models that use the domain-specific LM as the shared backbone. We examined several LLMs (LMs) in Arabic to select one that efficiently deals with the Qur'anic IR task. Besides transferring successful experiments from English to Arabic, we conducted additional experiments with retrieval task in Arabic to amortize the scarcity of general domain datasets used to train the retrieval models. Handling Qur'anic IR task combining English and Arabic allowed us to enhance the comparison and share valuable insights across models and languages.

References (59)

Summary

The paper leverages domain adaptation and data augmentation to overcome data scarcity and improve retrieval on Qur'anic texts in English and Arabic.
It introduces a two-stage curriculum learning strategy and employs in-domain training with augmented Islamic corpora to refine neural IR models.
Performance metrics such as MRR@10, NDCG@5, and Recall@100 indicate significant improvements, narrowing the gap between English and Arabic retrieval.

Overview of Qur'anic Information Retrieval

Introduction to Challenges in Qur'anic IR

Recent advancements have bolstered capabilities in NLP, yet applying deep learning to Qur'anic Information Retrieval (IR) remains an underexplored area. The scarcity of domain-specific data for training retrieval models presents a significant hurdle. Qur'anic IR involves English and Arabic languages. English is abundant with databases and pre-trained LLMs (LMs), while Arabic, a low-resource language depending on context, fares better within the Islamic domain with existing Islamic corpora.

Islamic Domain LLM Development

To address the lack of specific resources for English IR tasks within the Islamic domain, this work begins with the compilation of an Islamic corpus and the pre-training of a domain-specific LM. The pre-training employs a two-stage curriculum learning method using an Islamic corpus and the continued adaptation of a general domain model. A masking technique initially predicts masked tokens or subtokens before moving to predict entire masked words in a sentence, increasing task complexity.

Neural IR Models and Augmentation Techniques

For testing, the Qur'anic Reading Comprehension Dataset (QRCD) was adapted to an IR format. Models were evaluated using metrics like MRR@10, NDCG@5, and Recall@100. To improve training, data augmentation leveraged verse-related content from Islamic exegesis texts by identifying strongly correlated verse pairs, creating a more extensive in-domain training set. Domain-specific models were parallelly trained on general domain data before in-domain training to effectively use limited data.

Tackling the Qur'anic IR Task in Arabic

The paper expands to Qur'anic IR in Arabic, firstly, by selecting appropriate Arabic LMs and, secondly, by experimenting with training techniques that proved successful in English. Among these, distillation from a bilingual teacher-student model and further pre-training on a machine-translated Natural Language Inference dataset showed promising improvements. The final phase included in-domain training to refine the retrieval models further, as demonstrated by improved performance metrics.

Conclusion and Future Directions

The paper highlights the importance of domain-specific data augmentation and adaptation of pre-trained models to the domain. Although retrieval models trained for English IR tasks outperformed Arabic ones, the gap was narrowed by transferring successful strategies from English and applying machine translation to augment data. Future work could include exploring multilingual extensions and evaluating real-world user queries to refine these models' effectiveness further.

PDF Markdown

Tweets

https://twitter.com/1726221439961907200/status/1732370938761756865