TCM Vision Benchmark for Multimodal TCM Diagnostics

Updated 17 October 2025

TCM Vision Benchmark is a suite of evaluation frameworks and datasets tailored to assess AI models’ abilities in interpreting visual, textual, and sensory data in TCM diagnostics.
It integrates data from tongue images, herbal depictions, and pulse recordings using both automated and curated annotation pipelines for culturally authentic evaluation.
By standardizing diagnostic tasks and metrics such as accuracy and the Ladder-Score, these benchmarks lay the groundwork for advancing AI support in TCM education and clinical practice.

The term "TCM Vision Benchmark" designates a set of recent evaluation frameworks and multimodal datasets specifically crafted to assess LLMs and multimodal LLMs (MLLMs) in their ability to interpret, integrate, and reason with visual, textual, and sensory-rich data in the context of Traditional Chinese Medicine (TCM). These benchmarks address the unique challenges posed by the multimodal nature of TCM diagnosis—including the observation of physical features (e.g., tongue, herbal appearance, pulse)—and the need for domain-specific, culturally grounded evaluation criteria. TCM Vision Benchmarks are critical for the objective and standardized assessment of AI systems that aim to achieve competence in TCM education, clinical support, and research.

1. Conceptual Definition and Rationale

TCM Vision Benchmarks provide structured protocols and datasets for evaluating the visual recognition, diagnostic reasoning, and multimodal integration capabilities of AI models in Traditional Chinese Medicine. Unlike general biomedical datasets, these benchmarks are constructed using authoritative TCM materials, national licensing examinations, diagnostic atlases, and multimodal clinical data. Their rationale stems from the recognition that TCM diagnostics and education inherently involve complex sensory modalities (visual inspection, pulse analysis, herb identification), which conventional LLMs and unimodal question-answering systems inadequately address. By incorporating images, audio, and physiological signals, the benchmarks enable the comprehensive evaluation of models in tasks such as Medicinal Recognition, Visual Diagnosis, multimodal question answering, and cross-modal reasoning.

2. Principal Datasets and Evaluation Protocols

Major TCM Vision Benchmarks include:

ShizhenGPT Visual Benchmark (Chen et al., 20 Aug 2025): Constructed from images extracted from seven TCM diagnostic atlases, this benchmark covers Medicinal Recognition (herb, decoction piece identification), and Visual Diagnosis (tongue, palm, eye images, diagnostic scenes). Questions are formatted as multiple-choice (one image, four candidate captions, one correct). The benchmark is systematically separated from training sets and uses up-to-date official exam materials.
TCM-Ladder (Xie et al., 29 May 2025): The first standardized multimodal QA dataset for TCM, incorporating 52,000+ questions spanning text, images, audio, and video. Subfields include fundamental theory, diagnostics, herbal formulas, surgery, pharmacognosy, pediatrics, and clinical dialogues. Images consist of over 6,000 herbal medicine photos and nearly 1,400 tongue images, complemented by diagnostic audio/video data. Evaluative tasks include single/multi-choice QA, fill-in-the-blank, diagnostic dialogue, and visual comprehension.
TCM-3CEval (Huang et al., 10 Mar 2025): A triaxial benchmark focused on three core dimensions—core knowledge mastery (TCM-Exam), classical text understanding (TCM-LitQA), and clinical decision-making (TCM-MRCD). Evaluates models’ ability to answer multidimensional questions and process real clinical cases but is largely text-based, with ongoing plans for multimodal expansion.
TCM Vision Benchmarks in Retrieval-Augmented Systems (Liu et al., 13 Feb 2025): Implements structured Q&A benchmarks using tree-organized retrieval over SPO-T (subject–predicate–object–text) knowledge extracted from classic TCM texts. Evaluates LLM performance on licensing exam and classics course exam datasets (MLE and CCE), focusing primarily on consistency, safety, recall, and explainability in generated answers.
TCC-Bench (Xu et al., 16 May 2025): Although primarily focused on traditional Chinese culture, TCC-Bench evaluates MLLMs’ visual reasoning with images representing artifacts, clinical scenes, and culturally relevant objects, intersecting with the needs of TCM visual diagnostics.

Benchmark	Modalities Included	Task Types
ShizhenGPT	Image, text, audio, signals	Medicinal & Visual diagnosis
TCM-Ladder	Image, text, audio, video	Multimodal QA, dialogue
TCM-3CEval	Text (expanding to modal)	Core theory, clinical cases
TCC-Bench	Image, text (CN/EN)	Visual QA, reasoning

3. Methodologies for Data Construction and Evaluation

Construction of these benchmarks employs both automated and manual approaches:

Data Collection: Images are sourced from specialist TCM atlases, national museum archives, clinical records, and field photography. Audio and physiological data derive from real patient records, diagnostic sound recordings, and pulse sensors.
Annotation Pipelines: Automated annotation via LLMs (e.g., GPT-4o in text-only mode for TCC-Bench), filtered by human curators for quality, cultural authenticity, and avoidance of data leakage. Options and explanations in multiple-choice questions are designed to require genuine understanding and recognition.
Evaluation Metrics: Standard accuracy (number of correct / total responses), recall accuracy on knowledge recall tasks, and manual scoring across dimensions (safety, consistency, explainability, compliance, self-consistency). TCM-Ladder introduces Ladder-Score:

$\text{Ladder-Score} = a \cdot \text{TermScore} + B \cdot \text{SemanticScore}$

where $a = 0.4$ , $B = 0.6$ , TermScore evaluates TCM-specific terminology usage, and SemanticScore covers logical consistency and comprehensiveness.

Technical Integration: Visual encoders (e.g., Qwen2.5-VL), signal adapters for audio and physiological data, and fusion via MLP adapters align multimodal inputs to the LLM embedding space. Typical mapping for vision:

$y = \text{MLP}(\text{Concat}(x_1, x_2, x_3, x_4))$

where $x_i$ are patch embeddings of adjacent image regions.

4. Model Performance and Comparative Analysis

Experiments across these benchmarks illustrate distinct performance hierarchies:

Domain-Specific Multimodal LLMs (ShizhenGPT, Ladder-base): Outperform general models in visual TCM reasoning, medicinal recognition, and diagnostic accuracy (e.g., ShizhenGPT-32B achieves mid-high 60% accuracy across visual subtasks (Chen et al., 20 Aug 2025)). Models trained with large-scale multimodal TCM corpora demonstrate improved unified perception across image, sound, pulse, and text.
General-Purpose Models (GPT-4o, Claude, Gemini-2.0 Flash): Display lower accuracy and higher misinterpretation rates when faced with domain-specific and culturally sensitive diagnostic images (Xu et al., 16 May 2025). Performance often drops in culturally nuanced subfields and complex multimodal clinical reasoning.
Linguistic and Cultural Alignment: Models pre-trained on Chinese corpora and TCM-specific data (e.g., DeepSeek, InternLM2.5, PULSE) outperform international counterparts, especially in classical text understanding and syndrome diagnosis (Huang et al., 10 Mar 2025).
Few-Shot and Chain-of-Thought (CoT) Prompting: Limited efficacy in improving visual recognition performance. In some cases, CoT prompts induced hallucinations and did not enhance accuracy, indicating the need for refined reasoning protocols in TCM vision tasks (Xu et al., 16 May 2025).

5. Evaluation Dimensions and Practical Significance

TCM Vision Benchmarks assess model outputs over multifaceted criteria:

Safety: Alignment with clinical safety standards, avoidance of contraindicated suggestions in responses (Liu et al., 13 Feb 2025).
Consistency and Self-consistency: Logical, semantic, and factual agreement within and across multi-turn or multimodal answers (Liu et al., 13 Feb 2025, Xie et al., 29 May 2025).
Explainability: Explicit reference to domain knowledge sources (e.g., SPO-T elements, visual evidence) supporting model decisions.
Compliance: Adherence to task requirements, with minimal digression or extraneous information.
Domain-specific Reasoning: Ability to perform syndrome differentiation, identify herbs, interpret images such as tongue photographs, and produce coherent diagnostic plans.

The combination of automatic scoring and manual expert evaluation ensures that models are judged not just on generic AI QA abilities but by their competence in rigorous clinical and educational contexts.

6. Impact on Research, Clinical Practice, and Future Directions

TCM Vision Benchmarks play a pivotal role in:

Standardized Model Assessment: Providing a unified protocol for fair and reproducible evaluation of TCM LLMs/MLLMs across multimodal diagnostic scenarios.
Advancing AI Integration in TCM: Enabling models to effectively assist clinicians, support education, and institutionalize evidence-based TCM practices (Liu et al., 13 Feb 2025).
Benchmark Availability and Expansion: Resources—datasets, leaderboards, code—are publicly accessible (e.g., https://tcmladder.com, http://github.com/Morty-Xu/TCC-Bench (Xie et al., 29 May 2025, Xu et al., 16 May 2025)), with provisions for continuous update based on contributions and real-world feedback.
Challenges and Open Problems: Addressing data scarcity, image quality, annotation labor, and expansion to underrepresented subfields (e.g., comprehensive pulse diagnostics, olfactory inspection). Future directions include enhanced automation in culturally sensitive data generation, multimodal fusion refinement, and deeper integration of clinical records (Huang et al., 10 Mar 2025, Xie et al., 29 May 2025).

A plausible implication is that mastery of these benchmarks may establish operational baselines for practical deployment of multimodal AI systems in TCM clinical environments, educational settings, and research workflows, bridging the gap between classical medical knowledge and contemporary AI technology.

7. Technical Innovations and Limitations

Technical advancements reflected in TCM Vision Benchmarks include:

Hierarchical Knowledge Organization: Tree-structured knowledge bases and retrieval-augmented generation techniques, improving answer retrieval and self-reflective consistency (Liu et al., 13 Feb 2025).
Modality-Specific Adapters: Vision and signal encoders with adapter layers enable integrated processing of multimodal TCM data (Chen et al., 20 Aug 2025).
Robust Evaluation Metrics: Ladder-Score and manual scoring criteria tailored to TCM terminology and reasoning structures (Xie et al., 29 May 2025).
Resource Requirements: Training models on these benchmarks requires substantial computational resources, extensive pre-training on large-scale multimodal corpora (e.g., ShizhenGPT uses 100GB+ of text and 200GB+ of multimodal data), and rigorous human curation for annotation and validation.

Current limitations include the labor intensity of human verification, incomplete coverage of the vast and nuanced scope of TCM knowledge, and challenges in producing models that handle cross-modal ambiguity (e.g., polysemous TCM terms or visual features not explicitly represented in training data).

In summary, the TCM Vision Benchmark suite represents a fundamental advance in the evaluation of AI models for Traditional Chinese Medicine, emphasizing multimodal integration, culturally precise reasoning, and rigorous domain-specific assessment protocols. These benchmarks furnish researchers and practitioners with standardized resources to systematically measure and improve the capabilities of AI systems in complex TCM contexts.