BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities (2412.07769v1)

Published 10 Dec 2024 in cs.CV

Abstract: This paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model (LMM) with a unified architecture that integrates text and visual modalities, enabling advanced image understanding and medical applications. BiMediX2 leverages the Llama3.1 architecture and integrates text and visual capabilities to facilitate seamless interactions in both English and Arabic, supporting text-based inputs and multi-turn conversations involving medical images. The model is trained on an extensive bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions for both text and image modalities, mixed in Arabic and English. We also propose the first bilingual GPT-4o based medical LMM benchmark named BiMed-MBench. BiMediX2 is benchmarked on both text-based and image-based tasks, achieving state-of-the-art performance across several medical benchmarks. It outperforms recent state-of-the-art models in medical LLM evaluation benchmarks. Our model also sets a new benchmark in multimodal medical evaluations with over 9% improvement in English and over 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by around 9% in UPHILL factual accuracy evaluations and excels in various medical Visual Question Answering, Report Generation, and Report Summarization tasks. The project page including source code and the trained model, is available at https://github.com/mbzuai-oryx/BiMediX2.

PDF HTML Abstract

Overview of BiMediX2: A Bilingual Bio-Medical Expert LMM for Multimodal Medical Applications

The paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical Expert Large Multimodal Model (LMM) designed for diverse medical tasks by integrating text and visual modalities. This model represents an advancement in addressing the biases present in existing medical AI systems which predominantly favor English, thereby potentially excluding non-English speaking regions, particularly those necessitating Arabic language support. BiMediX2's architecture leverages the Llama3.1 framework to enable seamless interactions across both languages, facilitating enhanced accessibility for diverse populations.

Key Contributions

Bilingual and Multimodal Framework: BiMediX2 uses a unified architecture integrating text and visual data, enabling tasks such as medical image understanding and multilingual text-based interactions. The model is founded on Llama3.1 and trained using an extensive dataset, BiMed-V, comprising over 1.6 million bilingual instructions across various medical modalities.
Benchmarking and Evaluation: The authors introduce BiMed-MBench, a new GPT-4o based bilingual benchmark with 286 medical queries across modalities, tested for correctness by medical experts. BiMediX2 surpasses state-of-the-art benchmarks, achieving significant gains of over 9% in English and 20% in Arabic evaluations compared to previous models, particularly in tasks like Visual Question Answering (VQA), Report Generation, and Report Summarization.
Arabization of Medical LMMs: In addressing the linguistic gap, the model sets a precedent by achieving considerable improvements in Arabic medical evaluations, which is crucial for regions where Arabic is widely spoken, but where current AI models offer limited support.

Experimental Results

The BiMediX2's efficacy is underscored by its top performance across various benchmarks. Most notably, it outperforms GPT-4 in UPHILL factual accuracy evaluations. Specifically, BiMediX2 70B demonstrates an average score of nearly 84.6% over medical datasets, indicating a robust understanding of clinical scenarios, and marking a significant enhancement compared to competing models. Additionally, the model showcases proficiency in medical image analysis by outperforming other models in both English and Arabic language evaluations on the BiMed-MBench.

Implications for AI in Healthcare

The implications of BiMediX2 span both practical and theoretical domains in AI. Practically, it offers a template for developing inclusive machine learning systems that address linguistic and modal diversity. Theoretically, it raises considerations around the integration of multimodal and bilingual capabilities within LLMs, posing new challenges and opportunities for research in improving model architectures and dataset frameworks that facilitate such advancements.

Conclusion and Future Directions

BiMediX2 represents a noteworthy development in bilingual, multimodal medical AI, aligning with the global need for inclusive healthcare solutions. Future research may expand on this foundation by enhancing safety and ethical considerations, particularly concerning model hallucinations and stereotypes. Continuing to refine bilingual and multimodal integration will be imperative for further innovations in AI-driven medical assistance. The deployment of this model, coupled with open access to its weights, will likely stimulate advancements in addressing the diverse linguistic and modality needs of global healthcare applications.