ShizhenGPT: Multimodal TCM Model
- ShizhenGPT is a multimodal large language model that integrates text, images, audio, and physiological signals to replicate holistic Traditional Chinese Medicine diagnostics.
- It leverages a vast, curated corpus—including over 100GB of TCM text and 200GB+ of multimodal data—and a specialized Qwen2.5-based architecture to ensure comprehensive domain encoding.
- Its advanced pretraining and instruction tuning strategies enable precise, unified diagnostic simulation, supporting both clinical decision-making and interdisciplinary research.
ShizhenGPT is the first multimodal LLM specifically designed for Traditional Chinese Medicine (TCM), incorporating extensive domain knowledge and sensory modalities to address two principal challenges: the scarcity of high-quality TCM data and the need for unified multimodal reasoning in TCM diagnostics, which traditionally involve looking, listening, smelling, and pulse-taking. ShizhenGPT achieves this by leveraging a large, curated corpus of text, images, audio, and physiological signals, alongside a specialized architectural backbone and instruction-tuning strategies. The model demonstrates superior performance in TCM-relevant evaluations and provides a platform for advanced diagnostic support, knowledge synthesis, and interdisciplinary research.
1. Data Collection and Multimodal Corpus Construction
ShizhenGPT is trained on the largest TCM-specific data resources assembled to date, integrating both textual and multimodal datasets:
- TCM Text Corpus: Over 5,000 classical texts were digitized and cleaned using MinerU, producing 3.8GB of book text. A 30K-term lexicon directed the extraction of web material from Common Crawl (2017–2023) and WeChat, which were rigorously filtered and deduplicated, yielding a high-quality online corpus of 21.2GB (5B tokens).
- TCM Image–Text Corpus: Approximately 17.6GB of image–text pairs were mined from book illustrations (51K images). 5M images from web sources were filtered by a CLIP-based classifier, resulting in 140.7GB of verified TCM-relevant image–text pairs. Synthetic alignment was further improved with multimodal LLM-generated captions, totaling 40.6GB.
- Audio and Physiological Signal Data: Doctor–patient dialogues from Huatuo‑26M contributed 58K paired audio segments via TTS synthesis. Pulse diagnosis datasets, heart sounds, ECG, and olfactory sensor readings (CUHKSZ-Odors) were incorporated, with signals standardized via waveform conversion and sensor matrix representation.
- Aggregate Corpus Size: Collectively, over 100GB of clean TCM text and 200GB+ of multimodal data—including 1.2M images, 200 hours of audio, and numerous physiological signal records—form the foundation for comprehensive multimodal learning.
This data curation strategy ensures coverage of both classical theory and modern clinical realities, enabling machine representation of the “four diagnostic methods” central to TCM practice.
2. Model Architecture and Pretraining Strategies
ShizhenGPT is implemented on the Qwen2.5 backbone (offered in both 7B and 32B parameter scales), explicitly augmented for multimodal capability:
- Vision Encoder: Initiated from Qwen2.5-VL with 2D-RoPE and window attention, followed by a two-layer MLP adapter to align visual patch groups to the LLM’s embedding space.
- Signal Encoder: Based on Whisper‑large‑v3; audio and non-audio signals (e.g. pulse, smell) are resampled, transformed into mel-spectrograms, and mapped by a simple MLP.
- Unified Modality Embedding: All modalities are projected into the same token-level embedding space, facilitating integrated reasoning and enabling the model to process composite TCM evidence seamlessly.
Pretraining Schedule
- Text-Only Pretraining (Stage 1): Utilizes 11.92B tokens (6.29B TCM-specific, 5.63B general) with 4096-token sequence packing. The training loss is computed on all tokens except <|endoftext|> delimiters. Training proceeds for one epoch, learning rate η = 5×10⁻⁵, batch size 256, warmup ratio 0.005.
- Multimodal Pretraining (Stage 2): Engages 1.86B tokens of image–text, 70M tokens of audio–text, plus 1.75B tokens resampled from Stage 1. Maintains sequence length and learning rate, but reduces batch size to 128.
- Instruction Tuning: Draws on modality-specific instruction sets—83K text, 65K vision (GPT‑4o-generated), plus audio and physiological signal tasks. Fine-tuning is run for three epochs (learning rate 5×10⁻⁶, batch size 128, warmup ratio 0.04) and applies loss exclusively on response tokens.
This architecture supports both deep domain encoding and versatile multimodal reasoning, directly mirroring the real-world complexity of TCM diagnostics.
3. Evaluation: Benchmarks and Performance Assessment
ShizhenGPT is evaluated on both textual and multimodal benchmarks tailored for TCM:
- National TCM Exams: Recent licensing and graduate entrance examinations (2024–2025) for Pharmacist, Physician, and Assistant Physician roles provide rigorous, temporally distinct benchmarks for textual knowledge and reasoning.
- Visual Benchmark: A novel TCM Vision Benchmark (7,204 multiple-choice tasks) incorporates medicinal recognition (herb, material, decoction pieces), tongue/palm inspection, eye diagnosis, holistic syndromic analysis, and Tuina gesture recognition, drawn from authoritative TCM atlases.
- Signal-Based Evaluation: Pulse-based pregnancy detection tasks (80% accuracy), heart sound classification, and cough/ECG diagnostic tasks offer further multimodal validation on physiological signals.
In comparative analysis, ShizhenGPT (both 7B and 32B versions) consistently surpasses comparable-scale TCM LLMs and matches or exceeds larger general-purpose models (e.g., DeepSeek-R1, Doubao-1.5-Pro). The model achieves state-of-the-art performance in visual understanding among TCM multimodal LLMs and demonstrates robust, unified perception capabilities.
4. Multimodal Reasoning and Unified Diagnostic Simulation
ShizhenGPT’s architecture enables holistic diagnostic simulation aligned with core TCM principles:
- Modality Integration: The model can process and fuse evidence from text, vision, audio, and physiological signals within a single inference pipeline by mapping features to token embeddings.
- Technical Implementation: For audio and non-audio signals, input data are resampled (typically to 16kHz for waveform), transformed to mel-spectrograms (window size 25ms, hop size 10ms), and projected via MLP adapters.
- Diagnostic Reasoning Cases: Model outputs illustrate integrated analysis: tongue images inform syndrome discrimination, audio input (cough or heart sounds) complements subjective complaint, and pulse signals contribute direct physiological context. These fused modalities support reasoning chains that emulate TCM clinical workflow.
This capability extends conventional NLP-based LLMs into a domain where machine learning can respond to the full spectrum of human sensory diagnostics—a necessity for TCM’s “four diagnostic methods.”
5. Methodological Contributions and Technical Specifics
ShizhenGPT establishes several leading methodological precedents:
- Data Filtering and Augmentation: Two-stage filtering (classifier scoring and semantic deduplication) improves corpus quality for both textual and image domains. Synthetic image–text pairing (using multimodal LLM alignment) expands data diversity and task coverage.
- Instruction Data Synthesis: Modality-specific instructions are generated and verified via advanced LLMs (GPT-4o, DeepSeek-V3), ensuring both coverage and reliability.
- Loss Function: Pretraining is governed by standard cross-entropy loss over tokens. For example,
with packed sequences of length 4096.
- Training Hyperparameters:
- Text pretraining: batch size = 256, learning rate = , warmup ratio = 0.005.
- Multimodal pretraining: batch size = 128, learning rate = .
- Instruction tuning: batch size = 128, learning rate = , 3 epochs, warmup ratio = 0.04.
This combination of large-scale multitask data, advanced encoding, and finely tuned optimization supports the depth and breadth required for TCM multimodal reasoning.
6. Impact and Availability
ShizhenGPT demonstrates notable advancements that address previously unresolved challenges in TCM-AI integration:
- Holistic Diagnostic AI: By simultaneously ingesting text, vision, and clinical signals, ShizhenGPT provides machine-based support for the diagnostic process unique to TCM, paving the way for both real-time clinical assistance and enriched educational tools.
- Benchmark Leadership: The model leads both TCM textual and visual benchmarks, establishing a new standard in multimodal medical reasoning and challenging larger general-purpose proprietary systems.
- Resource Availability: Datasets, models, and code for ShizhenGPT are published and freely accessible, facilitating both reproducibility and collaborative, interdisciplinary research initiatives.
A plausible implication is the emergence of AI systems capable of holistic perception and reasoning in domains previously inaccessible to conventional, single-modality LLMs.
7. Future Directions
ShizhenGPT lays a foundation for further scientific exploration at the intersection of medicine and AI:
- Multimodal Expansion: Ongoing improvements in signal classification (e.g., pulse, smell) and image understanding may enhance diagnostic accuracy and coverage.
- Generalization to Other Domains: The architectural blueprint—multimodal encoders fused with a strong LLM backbone and large, curated datasets—offers a model for adapting holistic reasoning to other sensory-driven medical or scientific fields.
- Collaborative Research: The open-source nature encourages ideation and hybridization with additional domain-specific insights, fostering progress in multimodal and integrative AI.
In sum, ShizhenGPT represents a paradigm shift for TCM and multimodal medicine-AI research, offering a comprehensive solution to the challenges of data scarcity and sensory complexity in traditional diagnostics (Chen et al., 20 Aug 2025).