TCM-5CEval: TCM LLM Benchmark
- TCM-5CEval is a benchmark that systematically assesses LLMs in traditional Chinese medicine across five key dimensions including foundational theory and clinical reasoning.
- It employs diverse question formats and rigorous permutation-based consistency tests to evaluate both objective knowledge and subjective interpretation.
- The framework measures performance in areas like classical text interpretation, materia medica, and non-pharmacological therapy to ensure comprehensive TCM competency.
TCM-5CEval is an advanced benchmark designed to systematically evaluate the comprehensive clinical and research competency of LLMs in Traditional Chinese Medicine (TCM). Developed in response to gaps in previous TCM-specific assessments, TCM-5CEval provides a five-dimensional framework covering the breadth of foundational theory, clinical reasoning, classical text interpretation, materia medica, and non-pharmacological therapy, supported by rigorous permutation-based consistency protocols and stratified question types. It serves both as a diagnostic instrument for model robustness and as a standardized research asset for benchmarking and analysis within the TCM–AI intersection (Huang et al., 17 Nov 2025).
1. Motivation and Historical Context
Foundational efforts such as TCM-3CEval sought to characterize LLM capabilities in core TCM knowledge, classical literacy, and clinical decision-making, but lacked fine-grained coverage of clinical therapeutics and omitted critical areas such as Chinese Materia Medica (CMM) and non-pharmacological practices (acupuncture, Tuina) (Huang et al., 17 Nov 2025). These limitations resulted in incomplete coverage of TCM’s rich content, and prior benchmarks primarily used objective items and did not interrogate model consistency under perturbations. TCM-5CEval extends these efforts by (1) splitting clinical therapeutics into drug-based and non-drug tracks, (2) introducing subjective (open-ended) items to all dimensions, and (3) incorporating strict permutation-based tests to quantify inference robustness.
2. Benchmark Composition and Content Dimensions
TCM-5CEval is organized into five sub-datasets, each validated by domain experts and drawn from authoritative "13th" and "14th" Five-Year Plan TCM textbooks. All sub-datasets comprise single-choice, multiple-choice, and open-ended questions stratified by difficulty (Easy/Medium/Hard) (Huang et al., 17 Nov 2025).
| Dimension | Single-choice | Multiple-choice | Open-ended |
|---|---|---|---|
| TCM-Exam (Core Knowledge) | 122 | 78 | 70 |
| TCM-LitQA (Classical Literacy) | 275 | 188 | 165 |
| TCM-MRCD (Clinical Decision) | 230 | 156 | 148 |
| TCM-CMM (Materia Medica) | 273 | 121 | 91 |
| TCM-ClinNPT (Non-pharm. Therapy) | 166 | 96 | 81 |
- TCM-Exam: Fundamental theories (Yin-Yang, Five Elements, Zang-Fu, diagnostics).
- TCM-LitQA: Interpretive tasks on canonical TCM texts (e.g., Huangdi Neijing, Shanghan Lun).
- TCM-MRCD: Syndrome differentiation, diagnostic reasoning, and prescription formulation.
- TCM-CMM: Herb properties, compatibilities, contraindications, and pharmaceutics.
- TCM-ClinNPT: Acupoint selection, Tuina, and other non-drug interventions.
3. Evaluation Methodology and Metrics
TCM-5CEval employs a unified suite of metrics. For single-/multiple-choice:
For open-ended items, a weighted combination of BERTScore (precision/recall) and macro-averaged recall is used:
Consistency is stringently defined via permutation-based testing: each choice item is re-ordered with cyclic permutations; a response is counted as "consistent correct" only if the model outputs the same correct answer in all variants:
Consistency degradation is quantified as
where is the standard accuracy and the strict consistency rate.
4. Experimental Protocol and Results
Fifteen LLMs spanning major contemporary architectures were benchmarked under standardized zero-shot, deterministic () prompting with rigorous output formatting. Machine-readable outputs enabled full automation of scoring (Huang et al., 17 Nov 2025).
High-performing models included deepseek_r1, Kimi_K2_Instruct_0905, gemini_2_5_pro, and grok_4_0709. All models achieved peak results on the core knowledge (TCM-Exam) dimension and weakest performance on classical interpretation (TCM-LitQA).
Selected quantitative results (fractional accuracy/score):
| Model | Exam | LitQA | MRCD | CMM | ClinNPT |
|---|---|---|---|---|---|
| deepseek_r1 | 0.798 | 0.731 | 0.733 | 0.746 | 0.640 |
| Kimi_K2_Instruct_0905 | 0.847 | 0.696 | 0.746 | 0.749 | 0.595 |
| gemini_2_5_pro | 0.779 | 0.620 | 0.724 | 0.726 | 0.612 |
| grok_4_0709 | 0.730 | 0.593 | 0.684 | 0.680 | 0.642 |
All models exhibited sharp decreases in consistency when subjected to permutation-based ordering, e.g., gemini_2_5_pro on TCM-Exam: , (); deepseek_r1 on TCM-ClinNPT: ().
5. Model Robustness and Consistency Testing
Rigorous permutation-based consistency protocols highlighted systemic fragility in LLM inference stability. Models frequently exhibited positional bias, responding correctly only to specific option orders. This phenomenon persists across architectures and dimensions, representing a critical bottleneck for reliable TCM-AI deployment.
Key findings:
- Largest performance degradation on subjective and classical content (TCM-LitQA, TCM-ClinNPT).
- Even top models failed to maintain robust outputs across all permutations, exposing a lack of true semantic understanding and reasoning invariance.
6. Contextualization with Related TCM Benchmarks
Within the broader TCM-AI empirical landscape, TCM-5CEval sets itself apart from earlier and parallel efforts:
- MTCMB: A multi-task framework incorporating TCM-5CEval as a core evaluation component, with additional modules for language understanding, safety, and prescription generation. MTCMB results corroborate TCM-5CEval trends: LLMs excel in knowledge recall but struggle on diagnostic reasoning and multi-label syndrome tasks, with few-shot and CoT prompting mitigating but not closing gaps (Kong et al., 2 Jun 2025).
- Manual 5C Evaluations: Previous frameworks integrated subjective scoring protocols (safety, consistency, explainability, compliance, coherence) scored by domain experts—demonstrating that permutation-invariant robustness remains under-addressed (Liu et al., 13 Feb 2025).
7. Implications, Limitations, and Future Directions
TCM-5CEval provides a detailed lens onto LLM capabilities in both foundational and applied TCM. Distinct performance specializations (e.g., deepseek_r1 on classical literacy, Kimi_K2 on materia medica) suggest training set bias and uneven coverage. Weaknesses in reasoning stability, syndrome differentiation, and classical interpretation pinpoint fundamental obstacles to clinical translation (Huang et al., 17 Nov 2025).
Identified future directions include:
- Integration of curated classical literature, rich CMM/ClinNPT corpora, and clinical electronic medical records.
- Development of TCM-specific disambiguation algorithms and multi-modal (e.g., tongue, pulse, imaging) modules.
- Construction of robust TCM knowledge graphs for explicit symbolic augmentation.
- Inclusion of real-world, interactive, and multi-turn consultation scenarios with tight feedback loops.
TCM-5CEval’s open release on MedBench positions it as a standardized instrument for comparative TCM-AI research, facilitating advances in culturally informed, safe, and robust deep medical LLMs.