Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People (2403.03640v6)

Published 6 Mar 2024 in cs.CL and cs.AI
Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People

Abstract: Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.

Democratizing Medical AI with Apollo: Multilingual LLMs for Global Healthcare

Introduction to Apollo LLMs

The Apollo project represents a significant stride forward in democratizing medical AI by developing Lightweight Multilingual Medical LLMs that aim to make medical knowledge accessible to 6 billion people worldwide. By focusing on the six most widely spoken languages—English, Chinese, Hindi, Spanish, French, and Arabic—Apollo seeks to bridge the language divide in healthcare information and services. This initiative is underscored by the creation of two key resources: the ApolloCorpora, a multilingual medical dataset, and the XMedBench, a benchmark for evaluating multilingual medical LLMs.

Building ApolloCorpora: A Multilingual Medical Dataset

The ApolloCorpora dataset has been meticulously assembled to include high-quality, language-specific medical texts. Sources include medical books, papers, encyclopedias, doctor-patient dialogues, exams, and clinical guidelines, ensuring a rich and diverse corpus. This dataset not only encompasses the vast spectrum of medical knowledge across different languages but also respects the localized nuances and cultural specifics embedded within each language's medical discourse.

The Apollo LLMs: Breaking New Ground in Multilingual Medical AI

The Apollo models, ranging from 0.5B to 7B parameters, have demonstrated remarkable performance, often outperforming models of equivalent size in the multilingual medical benchmark, XMedBench. The Apollo-7B model, in particular, sets a new standard as the state-of-the-art multilingual medical LLM for up to 70B parameter models. The exploration into lightweight models, such as Apollo, signifies a pivotal step towards embedding advanced medical AI capabilities directly into healthcare systems, especially in regions with limited access to medical resources.

The XMedBench: A Benchmark for Progress

The XMedBench serves as a platform to evaluate the medical knowledge and linguistic capabilities of LMMs across different languages. It focuses on assessing models through multiple-choice questions, a format conducive to examining a model's understanding of complex medical concepts and its ability to reason and infer. Results from the XMedBench highlight the Apollo series' superior performance, underscoring the effectiveness of the Apollo models in bridging the gap between AI and medical knowledge across languages.

Practical Implications and Future Horizons

The Apollo project brings to the fore the potential impact of multilingual medical LLMs in transforming global healthcare. By making medical knowledge more accessible across linguistic divides, Apollo contributes significantly toward the democratization of medical AI. Moreover, the adoption of models like Apollo in healthcare systems worldwide could enhance the quality of care and patient outcomes, especially in under-resourced regions.

The project also opens new avenues for future research in AI and healthcare, such as optimizing dataset sampling, refining Proxy Tuning methods, and exploring the combination of different LLMs for enhanced multilingual capabilities. The open-sourcing of the ApolloCorpora and the Apollo models invites the global research community to contribute to these endeavors, fostering innovation and collaboration in the pursuit of making healthcare more accessible and equitable across the globe.

Conclusion

The Apollo project represents a monumental step toward democratizing medical AI through the development of multilingual medical LLMs. By making medical knowledge accessible in the world's most widely spoken languages, Apollo has the potential to revolutionize global healthcare, making it more inclusive and effective. As we look to the future, the continued exploration and improvement of multilingual medical AI hold the promise of a more informed and healthy global population.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. MAQA: Medical Arabic Q&A Dataset. Harvard Dataverse, 2022. doi: 10.7910/DVN/Y2JBEZ.
  2. Usage of multilingual mobile translation applications in clinical settings. JMIR mHealth and uHealth, 1(1):e2268, 2013.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  4. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884, 2023.
  5. Disc-medllm: Bridging general large language models and real-world medical consultation. arXiv preprint arXiv:2308.14346, 2023.
  6. Improving medical communication: skills for a complex (and multilingual) clinical world. Canadian respiratory journal, 21:89–91, 2014.
  7. Spanish biomedical crawled corpus: A large, diverse dataset for spanish biomedical language models, 2021.
  8. Huatuogpt-ii, one-stage training for medical adaption of llms. arXiv preprint arXiv:2311.09774, 2023a.
  9. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023b.
  10. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530, 2023.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  12. Multilingual consultations in urgent medical care. The Translator, 27(1):75–93, 2021.
  13. Luigi Daniele and Suphavadeeprasit. Amplify-instruct: Synthetically generated diverse multi-turn conversations for efficient llm training. arXiv preprint arXiv:(coming soon), 2023. URL https://huggingface.co/datasets/LDJnr/Capybara.
  14. FreedomIntelligence. https://huggingface.co/datasets/FreedomIntelligence/sharegpt-{language}, 2023a.
  15. FreedomIntelligence. https://huggingface.co/datasets/FreedomIntelligence/WizardV2-Instruct -GPT4-Turbo-Chinese, 2023b.
  16. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  17. Overview of bioasq 2021-mesinesp track. evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials. In Overview of BioASQ 2021-MESINESP track. CEUR Workshop Proceedings, 2021.
  18. Clear-simple corpus for medical french. In ATA, 2018.
  19. Thuocl, 2016.
  20. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
  21. Measuring massive multitask language understanding. In Measuring massive multitask language understanding, 2020.
  22. Named entity recognition in hindi using hyperspace analogue to language and conditional random field. Pertanika Journal of Science & Technology, 26(4), 2018.
  23. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081, 2020.
  24. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
  25. Daniel L Klayman. Qinghaosu (artemisinin): an antimalarial drug from china. Science, 228(4703):1049–1055, 1985.
  26. krisfu. https://huggingface.co/datasets/krisfu/awesome-llm-datasets-only-Chinese/tree/main/sft-phase-processed, 2023.
  27. Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain. arXiv preprint arXiv:2304.04280, 2023a.
  28. MORFITT : Un corpus multi-labels d’articles scientifiques français dans le domaine biomédical. In Florian Boudin, Béatrice Daille, Richard Dufour, Oumaima Khettari, Maël Houbre, Léane Jourdan, and Nihel Kooli (eds.), 18e Conférence en Recherche d’Information et Applications – 16e Rencontres Jeunes Chercheurs en RI – 30e Conférence sur le Traitement Automatique des Langues Naturelles – 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pp.  66–70, Paris, France, 2023b. ATALA. URL https://hal.science/hal-04131591.
  29. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
  30. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  31. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023a.
  32. Huatuo-26m, a large-scale chinese medical qa dataset. arXiv preprint arXiv:2305.01526, 2023b.
  33. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023c.
  34. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143, 2022.
  35. Dexperts: Decoding-time controlled text generation with experts and anti-experts. arXiv preprint arXiv:2105.03023, 2021.
  36. Tuning language models by proxy. arXiv preprint arXiv:2401.08565, 2024a.
  37. Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset. Advances in Neural Information Processing Systems, 36, 2024b.
  38. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032, 2023.
  39. Towards a multilingual medical lexicon. In AMIA Annual Symposium Proceedings, volume 2006, pp.  534. American Medical Informatics Association, 2006.
  40. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pp.  353–367. PMLR, 2023.
  41. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pp.  248–260. PMLR, 2022.
  42. Adaptation of machine translation for multilingual information retrieval in the medical domain. Artificial intelligence in medicine, 61(3):165–185, 2014.
  43. A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523, 2023.
  44. Towards building multilingual language model for medicine. arXiv preprint arXiv:2402.13963, 2024.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  46. Determinants of prakriti, the human constitution types of indian traditional medicine and its correlation with contemporary science. Journal of Ayurveda and integrative medicine, 5(3):167, 2014.
  47. Group differences between countries and between languages in pain-related beliefs, coping, and catastrophizing in chronic pain: a systematic review. Pain Medicine, 21(9):1847–1862, 2020.
  48. Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities, 2023.
  49. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
  50. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  51. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  52. Vezora. https://huggingface.co/datasets/Vezora/Tested-22k-Python-Alpaca, 2023.
  53. Head-qa: A healthcare dataset for complex reasoning. arXiv preprint arXiv:1906.04701, 2019.
  54. Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833, 2023.
  55. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023.
  56. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  57. WuDaoCorpora Text, December 2022. URL https://doi.org/10.57760/sciencedb.o00126.00004.
  58. A large language model for electronic health records. NPJ Digital Medicine, 5(1):194, 2022.
  59. The traditional medicine and modern medicine from natural products. Molecules, 21(5):559, 2016.
  60. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  61. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
  62. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075, 2023.
  63. Medical exam question answering with large-scale reading comprehension. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  64. Pmc-patients: A large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems. arXiv preprint arXiv:2202.13876, 2022.
  65. Path to medical agi: Unify domain-specific medical llms with the lowest cost. arXiv preprint arXiv:2306.10765, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Xidong Wang (30 papers)
  2. Nuo Chen (100 papers)
  3. Junyin Chen (1 paper)
  4. Yan Hu (75 papers)
  5. Yidong Wang (43 papers)
  6. Xiangbo Wu (8 papers)
  7. Anningzhe Gao (22 papers)
  8. Xiang Wan (93 papers)
  9. Haizhou Li (285 papers)
  10. Benyou Wang (109 papers)
  11. Guorui Zhen (1 paper)
  12. Chunxian Zhang (1 paper)
Citations (18)
Github Logo Streamline Icon: https://streamlinehq.com