Papers
Topics
Authors
Recent
2000 character limit reached

ApolloCorpora: Multilingual Medical Text Dataset

Updated 4 January 2026
  • ApolloCorpora is a curated multilingual medical text dataset featuring 2.5 billion tokens across six major languages for medical AI research.
  • It aggregates high-quality data from medical books, research papers, encyclopedic entries, dialogues, and guidelines to support inclusive AI development.
  • Its design facilitates the creation of lightweight medical LLMs and domain adaptation, while safeguarding indigenous medical knowledge and local expertise.

ApolloCorpora is a specialized multilingual medical text dataset assembled to train the Apollo family of lightweight medical LLMs (0.5B–7B parameters) in six of the world’s most-spoken languages: English, Chinese (Mandarin), Hindi, Spanish, French, and Arabic. Collectively, these languages cover approximately 6.1 billion speakers across 132 countries, targeting regions historically under-served by English-centric medical AI. ApolloCorpora’s design prioritizes open-access, local medical expertise, and linguistic inclusivity to democratize medical AI, facilitate domain adaptation for large models, and preserve culturally embedded medical knowledge (Wang et al., 2024).

1. Scope, Motivation, and Linguistic Coverage

ApolloCorpora addresses the predominance of English-centric medical corpora by sourcing high-quality medical texts natively in six major world languages: English, Chinese, Hindi, Spanish, French, and Arabic. The dataset’s objectives are to:

  • Democratize medical AI by providing extensive, open-source corpora in local languages for both on-device and low-resource deployment.
  • Preserve indigenous medical knowledge—such as traditional Chinese medicine (TCM) patterns, region-specific pharmacological references, and culturally appropriate communication—by assembling material in local languages rather than relying on machine-translated English resources.
  • Enable the development of relatively small, multilingual medical LLMs that support proxy-tuning for larger models, providing privacy-preserving domain adaptation without exposing confidential medical data.

Language selection was based on total speaker counts as identified by Wikipedia, ensuring maximal reach and relevance across global healthcare contexts (Wang et al., 2024).

2. Data Sources, Collection, and Licensing

ApolloCorpora aggregates two primary groups of documents:

A. High-Quality Medical Materials (Pre-training Core)

  • Medical books: 2,312 English titles from the Pile (filtered with UMLS, ≥4% medical vocabulary); 90 Chinese standard medical textbooks.
  • Peer-reviewed papers (abstracts): 878,241 English PubMed abstracts; 177,261 Chinese Medical Association abstracts; French biomedical articles from CLEAR and MORFITT; Spanish MESINESP-2021 abstracts.
  • Encyclopedia entries: 36,107 UMLS-filtered English Wikipedia medical pages; French encyclopedia subset from CLEAR; Hindi HHD disease/symptom corpus.
  • Clinical guidelines: English guidelines from NICE, SPOR, and sources cited by Meditron.
  • Medical exam Q&A: English (MedQA-USMLE, MedMCQA, PubMedQA), Chinese (CMExam, CMB, MedQA-MCMLE), French and Spanish (FrenchMedMCQA, HEAD-QA).
  • Doctor–patient dialogues: Chinese (HuatuoGPT / Huatuo_26M), English (GPT-3.5-generated based on PMC-Patients), Arabic (MAQA Q&A).

B. Supplementary Data (Instruction/RLHF and Generalization)

  • Web content: UMLS-filtered English C4, Chinese Wudao, Spanish CoWeSe.
  • General instruction data: ShareGPT, Alpaca, WizardV2-Chinese, Belebele, AI2-ARC, Capybara.
  • Mathematics/code: MathInstruct, Python-Alpaca, Leetcode-ZH-11K.

All sources are either open-source or public-domain, with the dataset team screening for open-source compliance. No private EHRs or proprietary data are included, and stringent data leakage checks were applied. No registration is required to access the corpus (Wang et al., 2024).

3. Dataset Composition and Statistics

ApolloCorpora comprises approximately 2.5 billion tokens. The token distribution across languages and major content types is summarized below.

Category EN (M tokens) ZH ES FR AR HI
Books 296.7 117.1 – – – –
Papers 252.9 45.6 46.0 4.5 – –
Encyclopedias 221.1 – – 4.6 – 0.5
Dialogues 92.1 46.6 – – 10.4 –
Exams 42.1 35.3 0.5 0.1 – –
Guidelines 29.6 – – – – –
Web 499.9 329.3 57.5 – – –
General SFT 194.5 69.4 18.4 20.0 18.7 43.9
Math 18.9 3.7 – – – –
Code 9.2 7.2 – – – –
Total/Language 1,682 654 122 29 29 44

Coverage spans internal medicine, pediatrics, infectious diseases, traditional medicine, pharmacology, exam preparation, patient education, and doctor–patient dialogue. All collected text is used for continued pre-training. Instruction-style subsets (QA pairs) are derived for downstream training phases. Formal train/val/test splits are not used in pre-training; evaluation is conducted via external benchmarks (XMedBench) with careful filtering to prevent any overlap (Wang et al., 2024).

4. Preprocessing, Curation, and Ethical Considerations

A comprehensive preprocessing pipeline standardizes and curates the ApolloCorpora content:

  • Deduplication & normalization: Unicode normalization, standardized punctuation, and cross-source duplicate removal.
  • QA rewriting: All pre-training segments are automatically recast into question–answer pairs via ChatGPT (gpt-3.5-turbo-16k) using language- and block-size–specific prompts (e.g., 2,048 tokens for EN/ES/FR/HI; 256 for ZH; 128 for AR).
  • Data-leakage filtering: Any QA pair or exam item with ≥64 contiguous character overlap with benchmark evaluation sets (e.g., XMedBench, held-out MedQA items) is removed (0.52% filtered).
  • Medical-expert curation: Medical practitioners and students oversee source selection and filtering, excluding irrelevant or non-medical passages.
  • Anonymization: Datasets such as PMC-Patients and MAQA are pre-anonymized, ensuring no patient-identifiable data enters the corpus.
  • Licensing and access: All sources are under open-source or public domain licensing, and ApolloCorpora is released with an Apache 2.0–style license (Wang et al., 2024).

5. Benchmark Integration and Research Applications

ApolloCorpora forms the basis for training and evaluating multilingual medical LLMs and provides the training corpus for XMedBench—a benchmark covering diverse languages and exam formats (MedQA-USMLE, MedMCQA, HEAD-QA, FrenchMedMCQA, CMMLU). No exact test items from these benchmarks are present in the training set due to stringent overlap filtering.

Core research and deployment tasks enabled by ApolloCorpora include:

  • Medical question answering: Multiple-choice and open-ended QA.
  • Dialogue generation: Simulation of doctor–patient interactions.
  • Medical summarization: Condensing guidelines, patient notes, or clinical references.
  • Knowledge retrieval: Reformulation for efficient extraction.
  • Downstream use cases: Offline triage/chatbots in low-connectivity regions, generation of patient education materials, AI-assisted exam preparation for medical trainees, and proxy-tuning for large LLMs (enabling rapid adaptation without direct exposure to private medical data) (Wang et al., 2024).

6. Accessibility, Licensing, and Reproducibility

ApolloCorpora, along with model weights, evaluation scripts, and codebases, is open-sourced for both academic and commercial use. Resources are available at:

No registration or permission is required. The team provides pipelines to reproduce both pre-training data curation and model development stages. All included datasets are under permissive open-source or public-domain licenses. The ApolloCorpora corpus is licensed under an Apache 2.0–style license, encouraging wide academic and commercial adoption (Wang et al., 2024).

In summary, ApolloCorpora represents a rigorously curated, multilingual medical text corpus (≈2.5B tokens) designed to foster inclusive, ethically grounded, and technically robust research in multilingual medical AI (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ApolloCorpora Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube