Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BiMediX: Bilingual Medical Mixture of Experts LLM (2402.13253v2)

Published 20 Feb 2024 in cs.CL

Abstract: In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question answering. We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations. We also introduce a comprehensive evaluation benchmark for Arabic medical LLMs. Furthermore, we introduce BiMed1.3M, an extensive Arabic-English bilingual instruction set covering 1.3 Million diverse medical interactions, resulting in over 632 million healthcare specialized tokens for instruction tuning. Our BiMed1.3M dataset includes 250k synthesized multi-turn doctor-patient chats and maintains a 1:2 Arabic-to-English ratio. Our model outperforms state-of-the-art Med42 and Meditron by average absolute gains of 2.5% and 4.1%, respectively, computed across multiple medical evaluation benchmarks in English, while operating at 8-times faster inference. Moreover, our BiMediX outperforms the generic Arabic-English bilingual LLM, Jais-30B, by average absolute gains of 10% on our Arabic medical benchmark and 15% on bilingual evaluations across multiple datasets. Our project page with source code and trained model is available at https://github.com/mbzuai-oryx/BiMediX .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Overview of the medical question answering task at trec 2017 liveqa. In TREC, pages 1–12.
  2. Bridging the gap between consumers’ medication questions and trusted answers. In MedInfo, pages 25–29.
  3. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  5. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
  6. BioMedLM: a domain-specific large language model for biomedical text. https://crfm.stanford.edu/2022/12/15/biomedlm.html.
  7. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
  8. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  10. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  11. Med42 - a clinical large language model.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  13. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  14. Arabart: a pretrained arabic sequence-to-sequence model for abstractive summarization. arXiv preprint arXiv:2203.10945.
  15. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
  16. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  18. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342.
  19. Mistral 7b. arXiv preprint arXiv:2310.06825.
  20. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  21. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  22. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146.
  23. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  24. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  25. LLm-analysis. 2023. Llm-analysis - latency and memory analysis of transformer models for training and inference. Availabe at https://github.com/cli99/llm-analysis.
  26. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409.
  27. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  28. Arat5: Text-to-text transformers for arabic language generation. arXiv preprint arXiv:2109.12068.
  29. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  30. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  31. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  32. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  33. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  34. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  35. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149.
  36. Biomegatron: Larger biomedical domain language model. arXiv preprint arXiv:2010.06060.
  37. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
  38. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  39. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
  40. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  43. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
  44. Deep bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35:37309–37323.
  45. Linkbert: Pretraining language models with document links. arXiv preprint arXiv:2203.15827.
  46. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sara Pieri (5 papers)
  2. Sahal Shaji Mullappilly (9 papers)
  3. Fahad Shahbaz Khan (225 papers)
  4. Rao Muhammad Anwer (67 papers)
  5. Salman Khan (244 papers)
  6. Timothy Baldwin (125 papers)
  7. Hisham Cholakkal (78 papers)
Citations (9)

Summary

Overview of "BiMediX: Bilingual Medical Mixture of Experts LLM"

The paper introduces BiMediX, a sophisticated bilingual LLM designed to address medical inquiries in both English and Arabic languages seamlessly. The predominant challenge it addresses is the absence of an LLM capable of effectively managing medical conversations across these two languages, particularly focusing on multi-turn interactions crucial in medical consultations. BiMediX extends its functionality to diverse medical interactions, including multi-turn chats, multiple-choice question answering (MCQA), and open-ended question answering (QA).

The research introduces a semi-automated English-to-Arabic translation pipeline that combines machine translation with human refinement, ensuring the translations maintain high fidelity to the original text. Additionally, the paper presents BiMed1.3M, a comprehensive bilingual dataset designed for instructing and fine-tuning the model. This dataset incorporates over 1.3 million diverse medical interactions, effectively bridging the gap in Arabic medical language processing, which has traditionally been constrained by a lack of resources.

Key Contributions and Numerical Results

BiMediX shows substantial advancements over leading models in the medical LLM field, such as Med42 and Meditron, with an average absolute improvement of 2.5% and 4.1% respectively across multiple English medical evaluation benchmarks. Remarkably, it achieves these gains while offering an 8-times faster inference, demonstrating both efficiency and superior performance. Furthermore, BiMediX outperforms the generic Jais-30B bilingual LLM by 10% on its Arabic medical benchmark and improves upon this in bilingual evaluations by 15%.

The novel BiMed1.3M dataset is crucial to this performance leap. It covers over 632 million healthcare-specialized tokens and includes over 250,000 synthesized multi-turn doctor-patient chat exchanges. The dataset, combined with parameter-efficient tuning strategies using a mixture of experts architecture, underpins BiMediX’s bilingual capabilities and substantial performance gains.

Theoretical and Practical Implications

The introduction of a bilingual medical LLM equipped with seamless interaction capabilities in Arabic and English presents substantial practice-oriented implications. It enables enhanced accessibility and accuracy in medical diagnosis and consultation for Arabic-speaking populations, addressing an important gap given the linguistic and resource constraints faced by previous models. The use of mixture of experts architectures also highlights practical efficiency by offering significant performance improvements with lower computational overhead during inference, which is critical for real-time applications in medical settings.

Theoretically, the paper advances LLM capabilities by exploring the utilities of bilingual datasets and domain-specific translation pipelines. It reflects on methodologies to overcome constraints in language-specific resources and provides a benchmark for evaluating bilingual medical LLMs, which could guide future research efforts aiming to introduce support for additional languages.

Future Directions

The developments and contributions discussed suggest several avenues for future research. Extending the multilingual capability by incorporating additional languages and examining the scalability of BiMediX’s architecture across broader linguistic pairs are promising directions. Further exploration of domain-specific fine-tuning methodologies could enhance the model's practical applications in diverse real-world scenarios, including specialized medical fields beyond the current dataset's scope.

The paper by Pieri et al. offers a substantial leap in bilingual LLM development, bridging critical gaps in medical AI applications for non-English speaking populations and setting the groundwork for further advancements in multilingual medical AI technologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com