Papers
Topics
Authors
Recent
2000 character limit reached

Chinese Pharmacist Licensing Examination

Updated 2 December 2025
  • Chinese Pharmacist Licensing Examination is a comprehensive assessment that tests both clinical knowledge and ethical practice through four well-structured units.
  • The exam uses a balanced mix of single-choice and multi-choice questions, including specialized modules for traditional Chinese medicine.
  • It plays a pivotal role in benchmarking LLM performance, revealing critical insights into AI accuracy and reasoning in pharmacy and healthcare.

The Chinese Pharmacist Licensing Examination (NPLE) is a high-stakes, standardized assessment used to certify clinical and theoretical competencies of pharmacists in China and related jurisdictions. This examination has emerged as a prominent benchmark for both the development of domain-aligned LLMs and for cross-linguistic, cross-systemic evaluation of healthcare knowledge reasoning, explainability, and clinical safety by AI systems (Wang et al., 25 Nov 2025, Luo et al., 2024, Li et al., 2023, Kong et al., 2 Jun 2025).

1. Exam Structure and Content Organization

The NPLE is formally structured into four primary units, each systematically sampling core domains of pharmaceutical expertise (Wang et al., 25 Nov 2025):

  • Unit 1: Pharmaceutical Foundations—factual recall across pharmacology, medicinal chemistry, pharmaceutics.
  • Unit 2: Legislation & Ethics—interpretation and application of Chinese drug law, professional ethics, regulatory requirements.
  • Unit 3: Dispensing & Prescription Review—case-based judgment addressing drug interactions and errors.
  • Unit 4: Clinical Integration—synthesis across disciplines using complex clinical-vignette scenarios.

The main question bank for contemporary LLM evaluations consists of 2,306 text-only items (derived 2017–2021) with 2,114 single-choice and 192 strictly scored MCQs (i.e., all options must be correct with no partial credit). The distribution by unit is nearly balanced: Unit 1—600 questions (26.0 %); Unit 2—598 (25.9 %); Unit 3—508 (22.0 %); Unit 4—600 (26.0 %). Items with tables or images are excluded from LLM assessment datasets.

The traditional Chinese medicine (TCM) section, as benchmarked in MTCMB, uses a complementary structure with twelve 100-question sub-disciplines and integrated 600-question full-exam modules covering herbal pharmacology, formula identification, dosage calculation, and safety/contraindication (Kong et al., 2 Jun 2025).

2. Scoring Criteria and Passing Thresholds

Each single-choice or MCQ item is worth one point, with MCQs requiring exact matching to the answer key for credit (no partial point allocation). Although the definitive NPLE passing threshold is undisclosed, passing is widely understood to require 60%–70% correct responses in practice (Wang et al., 25 Nov 2025). This aligns with pharmacology exams in other jurisdictions.

In LLM benchmarking studies, only absolute accuracy (i.e., proportion of exactly correct answers) is reported; human candidate pass rates are not utilized as calibrators.

3. Content Domain Distribution and Representative Question Types

The NPLE features a comprehensive sampling of pharmaceutical domains, typically clustering topics as follows (Luo et al., 2024):

Major Domain Typical Content/Skills Tested
Pharmacotherapy Clinical pharmacology and drug regimen adjustment
Pharmacy Practice & Management Prescription standards, regulation compliance
Dispensing & Clinical Pharmacy Practical scenario-based dispensing, safety
Pharmacology & Pharmaceutical Chemistry Mechanisms, synthesis pathways, chemoinformatics
Biopharmaceutics ADME, bioavailability parameters

EMPEC, a pan-profession benchmark, mirrors this structure with approximately 1,950 balanced items per domain over 9,767 validated MCQs (Luo et al., 2024). Common stem structures include clinical vignettes, law/regulation interpretation, reaction schemes (in LaTeX), and scenario-based dispensing decisions.

Free-text explanation datasets (e.g. ExplainCPE) annotate each multiple-choice item with gold-standard rationales and reference domain-reasoning chains, enabling deeper analysis of AI model interpretability and error modes (Li et al., 2023).

4. Statistical Evaluation and Benchmarking Methodologies

Official LLM evaluation protocols for NPLE-derived datasets employ the following statistical tests (Wang et al., 25 Nov 2025):

  • Pearson’s Chi-squared Test (χ2\chi^2): Compares overall model accuracy across 2,306 items using χ2=∑i(Oi−Ei)2Ei\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} where OiO_i and EiE_i are observed and expected values under the null hypothesis.
  • Fisher’s Exact Test: Applied to unit-wise and year-wise MCQ accuracy (where cell counts are low).
  • Four-sample Equality of Proportions Test: Assesses intra-model unit accuracy variation, followed by Bonferroni-adjusted pairwise comparisons.

Automatic metrics for explanation benchmarks include accuracy, ROUGE-1/2/L (measuring overlap with gold rationales), and split statistics (e.g., positive/negative, logic, calculation, scenario analysis) (Li et al., 2023).

5. LLM Performance and Error Profiling

Recent studies consistently show that generalized LLMs achieve high, yet non-expert, accuracy on NPLE items. Detailed results include (Wang et al., 25 Nov 2025, Luo et al., 2024, Kong et al., 2 Jun 2025, Li et al., 2023):

Model/System Pharmacist MCQ Accuracy (NPLE or EMPEC)
DeepSeek-R1 90.0 % (all units, 2017-2021)
ChatGPT-4o 76.1 %
GPT-4-turbo 75.9 % (EMPEC, Taiwan)
Llama-3-70B-Instruct 72.0 %
Doubao-1.5-Pro (TCM, MTCMB) >90 % (TCM knowledge MCQ)
Medical-domain LLMs (HuatuoGPT2) 15.9 %–51.7 % (lagging vs. generalist)
Random baseline 25 %

At unit and sub-domain resolution, DeepSeek-R1 surpasses ChatGPT-4o by as much as 28% in pharmacological foundations. In the TCM/Herbalism tracks (MTCMB), leading LLMs obtain >90% for knowledge recall, but only 80–86% on safety and <50% on inference-intensive synthesis (Kong et al., 2 Jun 2025).

Frequent error categories include: selection rationale misalignment (correct answer but faulty justification), negative cue misparsing, calculation errors, and scenario misclassification (Li et al., 2023). MCQ accuracy is persistently lower (–35–40 p.p.) relative to single-choice, reflecting ongoing deficits in combinatorial and multi-step reasoning. Model fine-tuning (e.g., Qwen1.5-7B-SFT) yields modest absolute gains (8%), though fine-tuned medical models do not consistently outperform general LLMs (Luo et al., 2024).

6. Implications for Education, AI Assessment, and Exam Design

Current evidence indicates that domain-specialized LLMs confer substantial gains on factual recall and clinical synthesis, suggesting that professional, high-stakes non-English exams are more effectively addressed by bespoke than generalist architectures (Wang et al., 25 Nov 2025). However, both general and medical-specific models struggle with MCQs, especially where combinatorial, contextually nuanced, or safety-critical judgment is required (Kong et al., 2 Jun 2025, Luo et al., 2024).

There are several broader implications:

  • Question Style as Diagnostic: Steep MCQ accuracy deficits reveal model limitations in multi-step reasoning; integrating MCQs that probe synthesis is diagnostically valuable for both candidate and model evaluation.
  • Human Oversight as Prerequisite: Model errors in regulatory or clinical safety contexts highlight a need for human curation, particularly in legal and ethical modules, admonishing against full AI automation in pharmacy certification.
  • Data for Formative Feedback: Corpora such as ExplainCPE and MTCMB, with gold explanations, underpin the development of explainable AI and adaptive, feedback-rich tutoring systems for pharmacy education (Li et al., 2023, Kong et al., 2 Jun 2025).
  • Domain Gap in Medical LLMs: Medical LLMs with narrow instruction tuning underperform on pharmacology MCQs relative to large generalist LLMs, implying a need to integrate balanced multiple-choice training data and advanced symbol processing (Luo et al., 2024).

7. Derived Benchmarks and Datasets

Several public benchmarks and datasets have emerged from the NPLE framework:

  • ExplainCPE: 7,000+ pharmacist MCQs with expert-reviewed explanations, annotated by logical/clinical domain, with accuracy and ROUGE metrics for explanation evaluation (Li et al., 2023).
  • MTCMB (TCM): Multi-task Chinese Medicine benchmark subsuming licensing-exam MCQs and real-world case diagnostics, targeting discrete knowledge, reasoning, safety, and synthesis capacities (Kong et al., 2 Jun 2025).
  • EMPEC: Broad healthcare-profession MCQ corpus (157,803 items) with a stand-alone pharmacist subset of 9,767 MCQs, enabling cross-profession and cross-year assessment (Luo et al., 2024).

These benchmarks have been pivotal in profiling the error modes, reasoning chains, and clinical safety boundaries in LLM evaluation for pharmacy and wider healthcare domains.


The Chinese Pharmacist Licensing Examination thus serves a dual function—as a regulatory gatekeeper and as a stress-test for current and emerging LLMs in clinical language understanding, domain reasoning, safety-critical recommendation, and professional certification contexts (Wang et al., 25 Nov 2025, Kong et al., 2 Jun 2025, Luo et al., 2024, Li et al., 2023).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Chinese Pharmacist Licensing Examination.