TongueAtlas-4K: High-Res Tongue Imaging

Updated 20 November 2025

TongueAtlas-4K is a comprehensive resource featuring 4,000 expert-labeled clinical tongue images and a corresponding 4D MRI atlas for motion and anatomical studies.
The dataset employs rigorous annotation protocols and ISO standards to ensure high label fidelity across 22 diagnostic features organized hierarchically.
Its high-resolution imaging and advanced motion analysis techniques enable precise quantification of tongue attributes for enhanced diagnostic automation and reproducibility.

TongueAtlas-4K defines a family of comprehensive, high-fidelity resources for tongue imaging and analysis in both diagnostic medicine and biomedical research. In current literature, the term refers specifically to a large-scale, expert-annotated dataset of 4,000 clinical tongue photographs for multi-label visual diagnosis, as well as the conceptual extension of “4K” (high-resolution) 4D atlases for quantitative motion and anatomical studies using MRI. These resources advance the automation, reproducibility, and precision of tongue attribute quantification, especially in Traditional Chinese Medicine (TCM) diagnostics and in biomechanical studies of lingual function.

1. Dataset Composition and Label Taxonomy

The TongueAtlas-4K diagnostic dataset comprises 4,000 expert-curated tongue photographs, each independently annotated for 22 distinct diagnostic features. Labels are organized hierarchically across four clinical dimensions:

Tongue Color (5 labels): pale, light-red, red, dark-red, blue-purple
Tongue Shape (7 labels): tender, tough, thin, enlarged, spots/thorns, cracks, teeth marks
Tongue Coating Property (7 labels): none, peeled, thin, thick, moist, dry, rotten/greasy
Tongue Coating Color (3 labels): white, yellow, gray-black

Each label’s prevalence varies significantly, reflecting real-world class imbalance—e.g., “white coating” occurs in 78.38% of cases, whereas “dark-red tongue” and “gray-black coating” are rare (2.15%, 3.35%, respectively). Labels within each dimension can co-occur, resulting in cumulative per-dimension frequencies exceeding 100% (Kong et al., 13 Nov 2025).

Dimension	# Labels	Most Frequent (%)	Rarest (%)
Tongue color	5	Light-red (52.8)	Dark-red (2.15)
Tongue shape	7	Teeth marks (53.7)	Tender (4.6)
Coating property	7	Thin (67.6)	Peeled (3.5)
Coating color	3	White (78.4)	Gray-black (3.4)

This taxonomy, codified via ISO 23961-1:2021, supports multi-label classification and aligns with international TCM diagnostic standards.

2. Annotation Protocol and Label Fidelity

Annotation involved ten classically trained TCM tongue-diagnosis practitioners. Each image received independent labels from all annotators according to ISO 23961-1:2021 terminologies (official English and Chinese definitions). A cross-review phase involved blinded label exchange for peer validation; discrepancies initiated a dual-expert audit, with final resolution by a senior TCM expert. While the pipeline emphasizes consensus and protocol rigor, explicit inter-annotator agreement statistics (e.g., Cohen’s κ) are not reported. The stated aim is maximized label fidelity via multi-stage human consensus (Kong et al., 13 Nov 2025).

3. Image Acquisition, Processing, and Augmentation

The underlying clinical images were sourced from two independent centers. Device type, image resolution, and camera models are not reported. Color calibration procedures included haze removal (to mitigate specular reflection/moisture fog) and reflectance normalization to minimize lighting variability.

Region-of-interest extraction proceeded via a DeepLabV3+ semantic segmentation model, with manual mask refinement in ITK-SNAP to optimize anatomical accuracy. All released images are tongue-segmented, background-free photographs.

For downstream model training, data augmentation (including RandAugment and random erasing) was used only during classifier fine-tuning to upsample minority (rare-label) classes, and is not part of the canonical dataset (Kong et al., 13 Nov 2025).

4. Dataset Properties, Imbalance, and Unlabeled Pretraining Pool

The label distribution spans several orders of magnitude, with pronounced skew toward common presentations (e.g., thin coating, light-red color) and strong underrepresentation for rarer pathologies. This imbalance must be explicitly addressed in modeling, for example, via asymmetric loss or boosting ensembles.

No demographic or clinical metadata—subject age, sex, comorbidity—accompanies the released dataset. Image acquisition metadata (e.g., device make/model, camera settings) is similarly absent, complicating certain analyses of confounders or device effects.

An auxiliary set of 15,905 unlabeled tongue images is published for self-supervised pretraining (e.g., with masked autoencoders), supporting transfer learning and improved representation fidelity in settings with annotation scarcity (Kong et al., 13 Nov 2025).

5. Benchmarks, Metrics, and Machine Learning Baselines

TongueAtlas-4K establishes a benchmark task suite for multi-label tongue diagnosis. Baseline performance was assessed using a 10% held-out test set (averaged across five runs) for a selection of modern computer vision models, using standard metrics for multi-label classification: Precision, Recall (Sensitivity), F1-score (including Macro-, Micro-, and Example-F1), and Macro PR-AUC.

Method	Macro-F1	Macro Precision	Macro Recall	Macro PR-AUC
LGAN	0.397	0.505	0.369	0.492
YOLO12-CLS	0.290	0.379	0.275	0.401
Faster R-CNN	0.381	0.485	0.339	0.493
IFRCNet	0.246	0.315	0.246	0.492
DenseNet121	0.403	0.487	0.364	0.351
C-GMVAE	0.346	0.459	0.305	0.526

Macro-F1 scores remain modest (≤0.403), reflecting class imbalance and the inherent subtlety of the task. LGAN and DenseNet121 perform best by Macro-F1; C-GMVAE offers the highest Macro PR-AUC. Class-frequency skew and inter-label co-occurrence pose persistent challenges (Kong et al., 13 Nov 2025).

6. Recommendations, Caveats, and Intended Applications

Researchers must address severe label imbalance and lack of subject-level metadata. Specialized techniques such as asymmetric loss, targeted augmentation, and boosting are recommended for rare labels. The lack of standardized imaging conditions and demographic annotation constrains certain analyses; future work could extend the dataset protocol to cover these confounders.

Use cases include multi-label classification research, benchmarking new computer vision and machine learning frameworks in TCM, and self-supervised pretraining on unlabeled tongue data. While the dataset’s ontology is tongue-specific, its architecture and imbalance characteristics suit it as a tractable surrogate for broader multi-label, small-cohort medical image tasks (Kong et al., 13 Nov 2025).

7. High-Resolution (4K) Tongue Motion Atlas in MRI: Conceptual Extension

In parallel with the photographic dataset, a high-resolution statistical multimodal atlas of 4D tongue motion for speech (e.g., “Speech Map” approach) provides the methodological precedent for constructing an anatomical and functional “TongueAtlas-4K” using cine- and tagged-MRI (Woo et al., 2017).

This concept involves the following stages:

Anatomical Reference Construction: Group-wise diffeomorphic registration (ANTs/SyN) of cine-MRI stacks generates an unbiased template $I_r(x)$ with mappings $\phi_i$ between subject and atlas spaces, scalable to ≤0.6 mm isotropic “4K” resolution.
Motion Estimation: Phase Vector Incompressible Registration Algorithm (PVIRA) processes CSPAMM tagged MRI to extract subject motion fields $\varphi_i(x,t)$ , enforcing incompressibility in tissue masks.
Atlas-Space Transformation: Individual motion fields are conjugated into template space ( $\tilde\varphi_i(x) = \phi_i \circ \varphi_i \circ \phi_i^{-1}(x)$ ).
Quantitative Motion/Strain Analysis: Displacement $u(x,t)$ , deformation gradient $F(x,t)$ , and Lagrangian strain $E_L(x)$ allow computation of principal strains and kinematic quantities, with scalar summaries such as mean displacement and maximum shear mapped across populations.
Low-Rank Variability Modeling: Principal Component Analysis (PCA) on high-dimensional deformation fields ( $v_i \in \mathbb{R}^{3M}$ , $M=$ number of mask voxels) yields the dominant axes of normal and pathological variability for speech motor tasks.

Application domains range from speech science to surgical simulation, neurodegenerative diagnostics, and as initialization data for biomechanical finite-element modeling. The protocol operationalizes full 4D (3D+time) motion fields in a fully Lagrangian, group-comparable frame (Woo et al., 2017).

A plausible implication is that the standards and protocols developed for photographic diagnostic datasets and 4D MRI-based motion atlases are converging, enabling cross-modal analysis and integrative research in tongue function, pathology, and computer-assisted diagnosis.

Markdown Report Issue Upgrade to Chat

References (2)

MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging (2025)

Speech Map: A Statistical Multimodal Atlas of 4D Tongue Motion During Speech from Tagged and Cine MR Images (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TongueAtlas-4K.