MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks (2505.03427v1)

Published 6 May 2025 in cs.CL, cs.AI, and cs.HC

Abstract: LLMs have demonstrated significant promise for various applications in healthcare. However, their efficacy in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation. We conducted an extensive evaluation with five state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.

PDF Abstract

MedArabiQ: Benchmarking LLMs on Arabic Medical Tasks

The paper, "MedArabiQ: Benchmarking LLMs on Arabic Medical Tasks," introduces the MedArabiQ dataset—a novel benchmark designed to evaluate the performance of LLMs within the context of Arabic healthcare applications. The paper emerges in response to gaps identified in existing benchmarks that inadequately address the nuances of the Arabic medical domain, highlighting the need for dedicated resources that facilitate robust and equitable AI performance across diverse language contexts.

Dataset Design and Methodology

The MedArabiQ benchmark comprises seven datasets that encapsulate a range of medical tasks, including multiple-choice questions, fill-in-the-blank exercises, and patient-doctor dialogue simulations. These datasets span multiple medical specialties such as cardiology, oncology, neurology, and more, extracted from two primary sources: past examinations from Arabic-speaking medical institutions and the AraMed dataset derived from the Altibbi online health platform.

The preparation of these datasets involved several meticulous processes, including manual digitization and verification to ensure data accuracy and quality. In addition to the original datasets, the authors introduced variations, such as syntactic grammatical error correction and language-model-guided paraphrasing, to evaluate LLM adaptability and linguistic robustness. This comprehensive approach aimed to ensure that the benchmarks reflected realistic clinical scenarios and healthcare-specific linguistic challenges.

Evaluation and Results

The authors conducted extensive evaluations using five state-of-the-art LLMs: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA 3.1, and Qwen 2.5. These models were analyzed based on their performance across numerous benchmark tasks, using metrics such as accuracy for closed-ended tasks and BERTScore for open-ended question-answering tasks. The evaluations revealed that no single model consistently outperformed others across all benchmarks. Proprietary models such as Gemini 1.5 Pro and Claude 3.5 Sonnet showed heightened accuracy, particularly in multiple choice and structured tasks, while open-source models presented more variable performance—suggesting areas for optimization, particularly in structured medical knowledge retrieval and reasoning.

Implications and Future Directions

The findings of this paper underscore the need for further refinement and development of benchmarks tailored to non-English languages, especially ones as linguistically and dialectically diverse as Arabic. By establishing MedArabiQ, the authors provide a critical foundation not only for evaluating current LLM capabilities but also for guiding future model development and adaptation strategies aimed at improving the applicability of AI-driven solutions in the multilingual medical landscape.

Moreover, the research highlights pivotal issues such as potential data contamination and model bias, presenting an evaluation framework that examines LLM susceptibility to cognitive biases and suggests mitigation strategies such as few-shot learning integration. Such insights are imperative for guiding the ethical deployment of LLMs in sensitive fields such as healthcare, where fairness and accuracy are paramount.

Conclusion

Ultimately, "MedArabiQ: Benchmarking LLMs on Arabic Medical Tasks" is a salient contribution to the field of medical NLP. It paves the way for more nuanced AI research by stressing the importance of including diverse languages and cultural contexts in model training and evaluation. The future trajectory of LLM applications in healthcare will inevitably benefit from these precise and culturally aware benchmarks, ensuring broader AI accessibility and equity in diverse linguistic domains.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Mouath Abu Daoud (1 paper)
Chaimae Abouzahir (1 paper)
Leen Kharouf (1 paper)
Walid Al-Eisawi (1 paper)
Nizar Habash (66 papers)
Farah E. Shamout (16 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos