TCM Embedding Space (TCM-ES)

Updated 20 July 2025

TCM Embedding Space (TCM-ES) is a low-dimensional vector space that encodes complex symptom patterns to herbal therapies using large-scale empirical data.
It employs a Transformer-based autoencoder with contrastive learning to capture inter-symptom dependencies and derive interpretable latent dimensions linked to biological functions.
Validated with extensive clinical records, TCM-ES bridges traditional TCM diagnostics with modern biomedical insights, enabling drug repurposing and predictive knowledge graphs.

The TCM Embedding Space (TCM-ES) is a quantitatively derived, low-dimensional vector space that encodes the empirical mappings between complex symptom patterns and herbal therapies in Traditional Chinese Medicine (TCM). Constructed using large-scale TCM formula records and validated with extensive clinical data, TCM-ES provides a unified and interpretable representation that not only quantifies TCM diagnostic and treatment principles but also facilitates integration with modern biomedical entities such as diseases, proteins, and drugs. The latent dimensions of the embedding have been shown to correspond to key biological functions, and the structure enables discovery of latent disease relationships, quantification of efficacy, and the development of comprehensive TCM-informative knowledge graphs (Li et al., 15 Jul 2025).

1. Construction and Theoretical Foundations

TCM-ES is constructed using a Transformer-based autoencoder trained on a corpus of 84,491 ancient and classical TCM formula records. Each record consists of a symptom pattern (complex, patient-specific condition) and a corresponding herbal prescription. The encoder transforms the symptom pattern into a low-dimensional latent space via multiple Transformer layers with multi-head attention, capturing both inter-symptom dependencies and symptom-herb correspondences. The decoder reconstructs the prescription based on this representation. Additionally, a contrastive learning objective is imposed, such that matched symptom–herb pairs are embedded closer together than unmatched pairs.

Mathematically, the process can be formalized as:

Symptom pattern encoding: $z = f_{\text{enc}}(\text{symptom pattern})$
Prescription decoding: $\hat{y} = f_{\text{dec}}(z)$
Contrastive loss to cluster true pairs in embedding space

Principal component analysis (PCA) on the resulting 256-dimensional embedding reveals that the first three PCs capture over 72.8% of the variance, emphasizing an efficiently organized latent structure that reflects real-world clinical co-occurrences and compatibilities.

2. Quantitative Characterization and Embedding Structure

Within TCM-ES, each entity—symptom, syndrome (complex symptom pattern), herb, or formula—is represented as a vector in a continuous space. The learned geometry is such that:

Distance between symptom vectors is negatively correlated with their co-occurrence frequency in real clinical data.
Herb or herb pair distances align with clinical co-prescription frequencies.
The embedding coordinates are interpretable; for instance, clusters of diseases or herbs correspond to known TCM syndromes and formula families, despite syndrome labels being masked during training.

Importantly, the axes of greatest variance (PC1–PC3) correlate with specific biological functions (see §4 below). Individual neurons within the bottleneck layer exhibit selectivity for certain TCM syndromes, forming quantitative fingerprints of diagnostic categories.

3. Clinical and Biomedical Integration

Validation with over 18,000 hospital-based and general TCM clinical cases, as well as 150 long COVID cases, demonstrates that formulas embedded closer to a patient’s symptom vector are empirically associated with better clinical outcomes. The embedding was extended via “correspondence alignment” to modern entities:

Diseases are mapped based on associated TCM symptoms.
Herbal compounds inherit the embeddings of their herbs.
Target proteins and FDA-approved drugs are projected via known molecular interactions or shared clinical indications.

Proximity in TCM-ES not only reproduces observed co-treatment or co-diagnosis patterns, but also tracks with genetic relationships on the human protein–protein interaction (PPI) network, suggesting the embedding captures latent biological connectivity.

4. Biological Associations of Latent Dimensions

Analysis of the principal components of the TCM-ES shows strong associations with key biological processes:

PC1: Metabolism (including reactive oxygen species and fatty acid metabolism)
PC2: Immune regulation (e.g., immune response modulation, T cell activation)
PC3: Homeostatic mechanisms (such as lipid homeostasis and organism-level equilibrium)

The structure of the embedding—emerging solely from ancient empirical records without molecular input—aligns closely with essential biological functions, indicating that the core axes of TCM-ES reflect the physiological underpinnings of TCM theory.

5. Discovery of Latent Disease Relationships and Efficacy Assessment

TCM-ES uncovers hidden relationships among diseases through clustering in the vector space. Construction of a k-nearest neighbor (KNN) network of disease embeddings reveals that diseases assigned to similar MeSH categories cluster together—even where symptomatic or genetic overlap is not immediately apparent.

A bidirectional z-score (BZS) is defined for drug–disease pairs as:

$\text{BZS}(x, y) = \frac{Z_{x-y}(x, y) + Z_{y-x}(x, y)}{2}$

where $Z_{x-y}(x, y)$ and $Z_{y-x}(x, y)$ quantify the specificity and closeness, respectively, of the drug and disease in TCM-ES. Lower BZS values indicate higher efficacy and better mutual matching, and were found to correlate with clinical improvement.

6. Knowledge Graph Construction and Translational Potential

Utilizing the TCM-ES, a comprehensive TCM knowledge graph was established, encompassing nodes such as diseases, herbs/formulas, herbal compounds, targets, and drugs, all embedded within the same latent space. Edges connect disease nodes to their nearest neighbors—including proteins, compounds, and even other formula nodes—enabling prediction of new therapeutic associations.

For example, in the case of Rheumatoid Arthritis, the graph distinguished between Cold and Heat subtypes and identified candidate drugs, proteins, and compounds, including both established and novel associations supported by independent biological evidence.

This embedding-based graph framework provides a platform for hypothesis generation, drug repurposing, and translational research bridging traditional TCM and contemporary biomedical science.

7. Methodological Context and Broader Impact

TCM-ES complements and extends prior TCM informatics work that leveraged deep learning, network pharmacology, and graph neural networks for representing chemical, biological, and compatibility features of herbal medicines (Zeng et al., 18 Nov 2024, Wang et al., 17 Aug 2024). Unlike approaches exclusively focused on molecular or symbolic relationships, TCM-ES is derived from empirical clinical mapping of symptoms to therapies using large-scale unsupervised learning, enabling universal quantification and integration with external biomedical data.

The interpretable, continuous, cross-domain structure of TCM-ES represents a significant advance in standardizing, analyzing, and translating TCM principles. As such, it serves as a foundation for systematic exploration of TCM practice, efficacy measurement of disease-drug pairs, and the identification of new translational opportunities for both drug development and precision medicine (Li et al., 15 Jul 2025).