ApolloCorpora: Multilingual Medical Text Dataset
- ApolloCorpora is a curated multilingual medical text dataset featuring 2.5 billion tokens across six major languages for medical AI research.
- It aggregates high-quality data from medical books, research papers, encyclopedic entries, dialogues, and guidelines to support inclusive AI development.
- Its design facilitates the creation of lightweight medical LLMs and domain adaptation, while safeguarding indigenous medical knowledge and local expertise.
ApolloCorpora is a specialized multilingual medical text dataset assembled to train the Apollo family of lightweight medical LLMs (0.5B–7B parameters) in six of the world’s most-spoken languages: English, Chinese (Mandarin), Hindi, Spanish, French, and Arabic. Collectively, these languages cover approximately 6.1 billion speakers across 132 countries, targeting regions historically under-served by English-centric medical AI. ApolloCorpora’s design prioritizes open-access, local medical expertise, and linguistic inclusivity to democratize medical AI, facilitate domain adaptation for large models, and preserve culturally embedded medical knowledge (Wang et al., 2024).
1. Scope, Motivation, and Linguistic Coverage
ApolloCorpora addresses the predominance of English-centric medical corpora by sourcing high-quality medical texts natively in six major world languages: English, Chinese, Hindi, Spanish, French, and Arabic. The dataset’s objectives are to:
- Democratize medical AI by providing extensive, open-source corpora in local languages for both on-device and low-resource deployment.
- Preserve indigenous medical knowledge—such as traditional Chinese medicine (TCM) patterns, region-specific pharmacological references, and culturally appropriate communication—by assembling material in local languages rather than relying on machine-translated English resources.
- Enable the development of relatively small, multilingual medical LLMs that support proxy-tuning for larger models, providing privacy-preserving domain adaptation without exposing confidential medical data.
Language selection was based on total speaker counts as identified by Wikipedia, ensuring maximal reach and relevance across global healthcare contexts (Wang et al., 2024).
2. Data Sources, Collection, and Licensing
ApolloCorpora aggregates two primary groups of documents:
A. High-Quality Medical Materials (Pre-training Core)
- Medical books: 2,312 English titles from the Pile (filtered with UMLS, ≥4% medical vocabulary); 90 Chinese standard medical textbooks.
- Peer-reviewed papers (abstracts): 878,241 English PubMed abstracts; 177,261 Chinese Medical Association abstracts; French biomedical articles from CLEAR and MORFITT; Spanish MESINESP-2021 abstracts.
- Encyclopedia entries: 36,107 UMLS-filtered English Wikipedia medical pages; French encyclopedia subset from CLEAR; Hindi HHD disease/symptom corpus.
- Clinical guidelines: English guidelines from NICE, SPOR, and sources cited by Meditron.
- Medical exam Q&A: English (MedQA-USMLE, MedMCQA, PubMedQA), Chinese (CMExam, CMB, MedQA-MCMLE), French and Spanish (FrenchMedMCQA, HEAD-QA).
- Doctor–patient dialogues: Chinese (HuatuoGPT / Huatuo_26M), English (GPT-3.5-generated based on PMC-Patients), Arabic (MAQA Q&A).
B. Supplementary Data (Instruction/RLHF and Generalization)
- Web content: UMLS-filtered English C4, Chinese Wudao, Spanish CoWeSe.
- General instruction data: ShareGPT, Alpaca, WizardV2-Chinese, Belebele, AI2-ARC, Capybara.
- Mathematics/code: MathInstruct, Python-Alpaca, Leetcode-ZH-11K.
All sources are either open-source or public-domain, with the dataset team screening for open-source compliance. No private EHRs or proprietary data are included, and stringent data leakage checks were applied. No registration is required to access the corpus (Wang et al., 2024).
3. Dataset Composition and Statistics
ApolloCorpora comprises approximately 2.5 billion tokens. The token distribution across languages and major content types is summarized below.
| Category | EN (M tokens) | ZH | ES | FR | AR | HI |
|---|---|---|---|---|---|---|
| Books | 296.7 | 117.1 | – | – | – | – |
| Papers | 252.9 | 45.6 | 46.0 | 4.5 | – | – |
| Encyclopedias | 221.1 | – | – | 4.6 | – | 0.5 |
| Dialogues | 92.1 | 46.6 | – | – | 10.4 | – |
| Exams | 42.1 | 35.3 | 0.5 | 0.1 | – | – |
| Guidelines | 29.6 | – | – | – | – | – |
| Web | 499.9 | 329.3 | 57.5 | – | – | – |
| General SFT | 194.5 | 69.4 | 18.4 | 20.0 | 18.7 | 43.9 |
| Math | 18.9 | 3.7 | – | – | – | – |
| Code | 9.2 | 7.2 | – | – | – | – |
| Total/Language | 1,682 | 654 | 122 | 29 | 29 | 44 |
Coverage spans internal medicine, pediatrics, infectious diseases, traditional medicine, pharmacology, exam preparation, patient education, and doctor–patient dialogue. All collected text is used for continued pre-training. Instruction-style subsets (QA pairs) are derived for downstream training phases. Formal train/val/test splits are not used in pre-training; evaluation is conducted via external benchmarks (XMedBench) with careful filtering to prevent any overlap (Wang et al., 2024).
4. Preprocessing, Curation, and Ethical Considerations
A comprehensive preprocessing pipeline standardizes and curates the ApolloCorpora content:
- Deduplication & normalization: Unicode normalization, standardized punctuation, and cross-source duplicate removal.
- QA rewriting: All pre-training segments are automatically recast into question–answer pairs via ChatGPT (gpt-3.5-turbo-16k) using language- and block-size–specific prompts (e.g., 2,048 tokens for EN/ES/FR/HI; 256 for ZH; 128 for AR).
- Data-leakage filtering: Any QA pair or exam item with ≥64 contiguous character overlap with benchmark evaluation sets (e.g., XMedBench, held-out MedQA items) is removed (0.52% filtered).
- Medical-expert curation: Medical practitioners and students oversee source selection and filtering, excluding irrelevant or non-medical passages.
- Anonymization: Datasets such as PMC-Patients and MAQA are pre-anonymized, ensuring no patient-identifiable data enters the corpus.
- Licensing and access: All sources are under open-source or public domain licensing, and ApolloCorpora is released with an Apache 2.0–style license (Wang et al., 2024).
5. Benchmark Integration and Research Applications
ApolloCorpora forms the basis for training and evaluating multilingual medical LLMs and provides the training corpus for XMedBench—a benchmark covering diverse languages and exam formats (MedQA-USMLE, MedMCQA, HEAD-QA, FrenchMedMCQA, CMMLU). No exact test items from these benchmarks are present in the training set due to stringent overlap filtering.
Core research and deployment tasks enabled by ApolloCorpora include:
- Medical question answering: Multiple-choice and open-ended QA.
- Dialogue generation: Simulation of doctor–patient interactions.
- Medical summarization: Condensing guidelines, patient notes, or clinical references.
- Knowledge retrieval: Reformulation for efficient extraction.
- Downstream use cases: Offline triage/chatbots in low-connectivity regions, generation of patient education materials, AI-assisted exam preparation for medical trainees, and proxy-tuning for large LLMs (enabling rapid adaptation without direct exposure to private medical data) (Wang et al., 2024).
6. Accessibility, Licensing, and Reproducibility
ApolloCorpora, along with model weights, evaluation scripts, and codebases, is open-sourced for both academic and commercial use. Resources are available at:
No registration or permission is required. The team provides pipelines to reproduce both pre-training data curation and model development stages. All included datasets are under permissive open-source or public-domain licenses. The ApolloCorpora corpus is licensed under an Apache 2.0–style license, encouraging wide academic and commercial adoption (Wang et al., 2024).
In summary, ApolloCorpora represents a rigorously curated, multilingual medical text corpus (≈2.5B tokens) designed to foster inclusive, ethically grounded, and technically robust research in multilingual medical AI (Wang et al., 2024).