Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Building Multilingual Language Model for Medicine (2402.13963v4)

Published 21 Feb 2024 in cs.CL
Towards Building Multilingual Language Model for Medicine

Abstract: The development of open-source, multilingual medical LLMs can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source LLMs on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, we present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.

Towards a Multilingual LLM for the Medical Domain

Introduction

The development of LLMs has significantly propelled advancements in NLP applications within the medical domain. Despite notable successes, the preponderance of LLMs' focus on English has hindered their broader application across linguistically diverse regions. The paper discusses the inception of MMedC, a large-scale multilingual medical corpus, and MMedBench, a benchmark for evaluating LLMs' capabilities in medical question-answering across six primary languages. Through rigorous testing, the paper introduces MMedLM 2, a model that not only leverages MMedC for enhanced performance but also exhibits competencies rivalling those of GPT-4 in multilingual medical contexts.

Dataset Construction and Metrics

MMedC: A Multilingual Medical Corpus

MMedC stands distinct with its assembly of 25.5 billion tokens spanning six languages. It derives richness from a variety of sources:

  • Filtering medical content from a large-scale multilingual corpus
  • Including texts from medical textbooks and reputable medical websites
  • Incorporating existing medical corpora

This compilation underscores a collective endeavor to furnish a model that transcends linguistic barriers within the medical domain.

MMedBench: Benchmarking Multilingual Medical Understanding

The advent of MMedBench fills the void for a comprehensive evaluation tool by aggregating medical question-answering datasets across languages and supplementing them with rationale reasoning, hence offering a novel lens through which to assess LLMs. This process involves the augmentation of standard QA pairs with detailed rationales using GPT-4, followed by meticulous human verification to ensure quality and correctness.

Model Evaluation and Insights

The evaluation of MMedC and MMedBench yielded intriguing findings. Consistent with expectations, models trained on MMedC outperformed their contemporaries across various metrics under zero-shot, parameter-efficient fine-tuning (PEFT), and full fine-tuning settings. Notably, MMedLM 2 emerged as a formidable contender, demonstrating remarkable proficiency in multilingual medical question-answering and rationale generation, closely mirroring the performance metrics of GPT-4.

Theoretical and Practical Implications

Enhancing Multilingual Medical AI Research

The paper's endeavor to create MMedC and MMedBench catalyzes the exploration of general medical artificial intelligence (GMAI) and retrieval-augmented generation, facilitating the development of LLMs robust across languages and capable of integrating comprehensive medical knowledge.

Broader Clinical and Educational Outreach

The practical implications are profound, promising to alleviate language barriers in healthcare, tailor models to recognize cultural nuances, and democratize access to medical education globally. This endeavor opens avenues for deploying LLMs in diverse medical settings, ensuring equitable access to quality healthcare information.

Future Directions and Challenges

Despite its achievements, the paper acknowledges limitations such as the corpus's linguistic breadth and the computational scope of the final model. Future work will aim at extending language coverage, scaling model architectures, and refining the model to mitigate hallucination issues. The continuous evolution of MMedC and MMedBench aspires to bolster the development of LLMs that are both linguistically inclusive and deeply entrenched in medical knowledge.

Data and Resources Availability

In a move towards transparency and fostering further research, the authors have made the datasets, codebase, and trained models publicly accessible. This initiative is aimed at encouraging collaborative advancements and facilitating access to resources critical for extending the boundaries of multilingual medical natural language processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Openai. introducing chatgpt. https://openai.com/blog/chatgpt/, 2023.
  2. Rohan Anil et al. Palm 2 technical report. ArXiv, abs/2305.10403, 2023.
  3. BIT-ENGD. baidu_baike, 2023. GitHub repository.
  4. Rumedbench: A russian medical language understanding benchmark, 2022.
  5. Wikimedia Foundation. Wikimedia downloads.
  6. Medalpaca – an open-source collection of medical conversational ai models and training data. Apr 2023.
  7. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  8. Institute of Formal and Applied Linguistics. Ufal medical corpus. Online, 2024. Accessed: 2024-01-26.
  9. Mistral 7b, 2023.
  10. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081, 2020.
  11. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  12. Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations, 2023.
  13. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023.
  14. FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain. working paper or preprint, Oct. 2022.
  15. Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401, 2020.
  16. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  17. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
  18. Med-flamingo: A multimodal medical few-shot learner. July 2023. arXiv:2307.15189.
  19. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages, 2023.
  20. OpenAI. Gpt-4 technical report, 2023.
  21. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
  22. BigScienceWorkshop Scao et al. Bloom: A 176b-parameter open-access multilingual language model. Nov 2022.
  23. Large language models encode clinical knowledge. Nature, 620:172 – 180, 2022.
  24. Gemini Team. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805, 2023.
  25. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  27. Towards generalist biomedical ai. ArXiv, abs/2307.14334, 2023.
  28. HEAD-QA: A healthcare dataset for complex reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy, July 2019. Association for Computational Linguistics.
  29. Pmc-llama: Further finetuning llama on medical papers. Apr 2023.
  30. Towards generalist foundation model for radiology. ArXiv, abs/2308.02463, 2023.
  31. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. Mar 2023.
  32. Almanac — retrieval-augmented language models for clinical medicine. NEJM AI, 1(2):AIoa2300068, 2024.
  33. Retrieve anything to augment large language models. ArXiv, abs/2310.07554, 2023.
  34. Bertscore: Evaluating text generation with bert, 2020.
  35. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Pengcheng Qiu (7 papers)
  2. Chaoyi Wu (24 papers)
  3. Xiaoman Zhang (31 papers)
  4. Weixiong Lin (10 papers)
  5. Haicheng Wang (12 papers)
  6. Ya Zhang (222 papers)
  7. Yanfeng Wang (211 papers)
  8. Weidi Xie (132 papers)
Citations (30)
Github Logo Streamline Icon: https://streamlinehq.com