TCM-5CEval: TCM LLM Benchmark

Updated 24 November 2025

TCM-5CEval is a benchmark that systematically assesses LLMs in traditional Chinese medicine across five key dimensions including foundational theory and clinical reasoning.
It employs diverse question formats and rigorous permutation-based consistency tests to evaluate both objective knowledge and subjective interpretation.
The framework measures performance in areas like classical text interpretation, materia medica, and non-pharmacological therapy to ensure comprehensive TCM competency.

TCM-5CEval is an advanced benchmark designed to systematically evaluate the comprehensive clinical and research competency of LLMs in Traditional Chinese Medicine (TCM). Developed in response to gaps in previous TCM-specific assessments, TCM-5CEval provides a five-dimensional framework covering the breadth of foundational theory, clinical reasoning, classical text interpretation, materia medica, and non-pharmacological therapy, supported by rigorous permutation-based consistency protocols and stratified question types. It serves both as a diagnostic instrument for model robustness and as a standardized research asset for benchmarking and analysis within the TCM–AI intersection (Huang et al., 17 Nov 2025).

1. Motivation and Historical Context

Foundational efforts such as TCM-3CEval sought to characterize LLM capabilities in core TCM knowledge, classical literacy, and clinical decision-making, but lacked fine-grained coverage of clinical therapeutics and omitted critical areas such as Chinese Materia Medica (CMM) and non-pharmacological practices (acupuncture, Tuina) (Huang et al., 17 Nov 2025). These limitations resulted in incomplete coverage of TCM’s rich content, and prior benchmarks primarily used objective items and did not interrogate model consistency under perturbations. TCM-5CEval extends these efforts by (1) splitting clinical therapeutics into drug-based and non-drug tracks, (2) introducing subjective (open-ended) items to all dimensions, and (3) incorporating strict permutation-based tests to quantify inference robustness.

2. Benchmark Composition and Content Dimensions

TCM-5CEval is organized into five sub-datasets, each validated by domain experts and drawn from authoritative "13th" and "14th" Five-Year Plan TCM textbooks. All sub-datasets comprise single-choice, multiple-choice, and open-ended questions stratified by difficulty (Easy/Medium/Hard) (Huang et al., 17 Nov 2025).

Dimension	Single-choice	Multiple-choice	Open-ended
TCM-Exam (Core Knowledge)	122	78	70
TCM-LitQA (Classical Literacy)	275	188	165
TCM-MRCD (Clinical Decision)	230	156	148
TCM-CMM (Materia Medica)	273	121	91
TCM-ClinNPT (Non-pharm. Therapy)	166	96	81

TCM-Exam: Fundamental theories (Yin-Yang, Five Elements, Zang-Fu, diagnostics).
TCM-LitQA: Interpretive tasks on canonical TCM texts (e.g., Huangdi Neijing, Shanghan Lun).
TCM-MRCD: Syndrome differentiation, diagnostic reasoning, and prescription formulation.
TCM-CMM: Herb properties, compatibilities, contraindications, and pharmaceutics.
TCM-ClinNPT: Acupoint selection, Tuina, and other non-drug interventions.

3. Evaluation Methodology and Metrics

TCM-5CEval employs a unified suite of metrics. For single-/multiple-choice:

$\mathrm{Acc} = \frac{\text{Number of Correct Responses}}{\text{Total Questions}}$

For open-ended items, a weighted combination of BERTScore (precision/recall) and macro-averaged recall is used:

$\mathrm{Score}_{\mathrm{OE}} = \lambda\,\mathrm{BERTScore} + (1-\lambda)\,\mathrm{MacroRecall},\quad \lambda = 0.5$

Consistency is stringently defined via permutation-based testing: each choice item is re-ordered with $k=5$ cyclic permutations; a response is counted as "consistent correct" only if the model outputs the same correct answer in all variants:

$\mathrm{ConsistentAcc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl(\forall j,\; \hat y_{i,j} = y_i^*\bigr)$

Consistency degradation is quantified as

$\Delta S = S_{\text{orig}} - S_{\text{perm}}$

where $S_{\text{orig}}$ is the standard accuracy and $S_{\text{perm}}$ the strict consistency rate.

4. Experimental Protocol and Results

Fifteen LLMs spanning major contemporary architectures were benchmarked under standardized zero-shot, deterministic ( $T=0$ ) prompting with rigorous output formatting. Machine-readable outputs enabled full automation of scoring (Huang et al., 17 Nov 2025).

High-performing models included deepseek_r1, Kimi_K2_Instruct_0905, gemini_2_5_pro, and grok_4_0709. All models achieved peak results on the core knowledge (TCM-Exam) dimension and weakest performance on classical interpretation (TCM-LitQA).

Selected quantitative results (fractional accuracy/score):

Model	Exam	LitQA	MRCD	CMM	ClinNPT
deepseek_r1	0.798	0.731	0.733	0.746	0.640
Kimi_K2_Instruct_0905	0.847	0.696	0.746	0.749	0.595
gemini_2_5_pro	0.779	0.620	0.724	0.726	0.612
grok_4_0709	0.730	0.593	0.684	0.680	0.642

All models exhibited sharp decreases in consistency when subjected to permutation-based ordering, e.g., gemini_2_5_pro on TCM-Exam: $S_{\rm orig}=0.920$ , $S_{\rm perm}=0.844$ ( $\Delta S=0.076$ ); deepseek_r1 on TCM-ClinNPT: $0.787\rightarrow0.470$ ( $\Delta S=0.317$ ).

5. Model Robustness and Consistency Testing

Rigorous permutation-based consistency protocols highlighted systemic fragility in LLM inference stability. Models frequently exhibited positional bias, responding correctly only to specific option orders. This phenomenon persists across architectures and dimensions, representing a critical bottleneck for reliable TCM-AI deployment.

Key findings:

Largest performance degradation on subjective and classical content (TCM-LitQA, TCM-ClinNPT).
Even top models failed to maintain robust outputs across all permutations, exposing a lack of true semantic understanding and reasoning invariance.

Within the broader TCM-AI empirical landscape, TCM-5CEval sets itself apart from earlier and parallel efforts:

MTCMB: A multi-task framework incorporating TCM-5CEval as a core evaluation component, with additional modules for language understanding, safety, and prescription generation. MTCMB results corroborate TCM-5CEval trends: LLMs excel in knowledge recall but struggle on diagnostic reasoning and multi-label syndrome tasks, with few-shot and CoT prompting mitigating but not closing gaps (Kong et al., 2 Jun 2025).
Manual 5C Evaluations: Previous frameworks integrated subjective scoring protocols (safety, consistency, explainability, compliance, coherence) scored by domain experts—demonstrating that permutation-invariant robustness remains under-addressed (Liu et al., 13 Feb 2025).

7. Implications, Limitations, and Future Directions

TCM-5CEval provides a detailed lens onto LLM capabilities in both foundational and applied TCM. Distinct performance specializations (e.g., deepseek_r1 on classical literacy, Kimi_K2 on materia medica) suggest training set bias and uneven coverage. Weaknesses in reasoning stability, syndrome differentiation, and classical interpretation pinpoint fundamental obstacles to clinical translation (Huang et al., 17 Nov 2025).

Identified future directions include:

Integration of curated classical literature, rich CMM/ClinNPT corpora, and clinical electronic medical records.
Development of TCM-specific disambiguation algorithms and multi-modal (e.g., tongue, pulse, imaging) modules.
Construction of robust TCM knowledge graphs for explicit symbolic augmentation.
Inclusion of real-world, interactive, and multi-turn consultation scenarios with tight feedback loops.

TCM-5CEval’s open release on MedBench positions it as a standardized instrument for comparative TCM-AI research, facilitating advances in culturally informed, safe, and robust deep medical LLMs.