TongueAtlas-4K: High-Res Tongue Imaging
- TongueAtlas-4K is a comprehensive resource featuring 4,000 expert-labeled clinical tongue images and a corresponding 4D MRI atlas for motion and anatomical studies.
- The dataset employs rigorous annotation protocols and ISO standards to ensure high label fidelity across 22 diagnostic features organized hierarchically.
- Its high-resolution imaging and advanced motion analysis techniques enable precise quantification of tongue attributes for enhanced diagnostic automation and reproducibility.
TongueAtlas-4K defines a family of comprehensive, high-fidelity resources for tongue imaging and analysis in both diagnostic medicine and biomedical research. In current literature, the term refers specifically to a large-scale, expert-annotated dataset of 4,000 clinical tongue photographs for multi-label visual diagnosis, as well as the conceptual extension of “4K” (high-resolution) 4D atlases for quantitative motion and anatomical studies using MRI. These resources advance the automation, reproducibility, and precision of tongue attribute quantification, especially in Traditional Chinese Medicine (TCM) diagnostics and in biomechanical studies of lingual function.
1. Dataset Composition and Label Taxonomy
The TongueAtlas-4K diagnostic dataset comprises 4,000 expert-curated tongue photographs, each independently annotated for 22 distinct diagnostic features. Labels are organized hierarchically across four clinical dimensions:
- Tongue Color (5 labels): pale, light-red, red, dark-red, blue-purple
- Tongue Shape (7 labels): tender, tough, thin, enlarged, spots/thorns, cracks, teeth marks
- Tongue Coating Property (7 labels): none, peeled, thin, thick, moist, dry, rotten/greasy
- Tongue Coating Color (3 labels): white, yellow, gray-black
Each label’s prevalence varies significantly, reflecting real-world class imbalance—e.g., “white coating” occurs in 78.38% of cases, whereas “dark-red tongue” and “gray-black coating” are rare (2.15%, 3.35%, respectively). Labels within each dimension can co-occur, resulting in cumulative per-dimension frequencies exceeding 100% (Kong et al., 13 Nov 2025).
| Dimension | # Labels | Most Frequent (%) | Rarest (%) |
|---|---|---|---|
| Tongue color | 5 | Light-red (52.8) | Dark-red (2.15) |
| Tongue shape | 7 | Teeth marks (53.7) | Tender (4.6) |
| Coating property | 7 | Thin (67.6) | Peeled (3.5) |
| Coating color | 3 | White (78.4) | Gray-black (3.4) |
This taxonomy, codified via ISO 23961-1:2021, supports multi-label classification and aligns with international TCM diagnostic standards.
2. Annotation Protocol and Label Fidelity
Annotation involved ten classically trained TCM tongue-diagnosis practitioners. Each image received independent labels from all annotators according to ISO 23961-1:2021 terminologies (official English and Chinese definitions). A cross-review phase involved blinded label exchange for peer validation; discrepancies initiated a dual-expert audit, with final resolution by a senior TCM expert. While the pipeline emphasizes consensus and protocol rigor, explicit inter-annotator agreement statistics (e.g., Cohen’s κ) are not reported. The stated aim is maximized label fidelity via multi-stage human consensus (Kong et al., 13 Nov 2025).
3. Image Acquisition, Processing, and Augmentation
The underlying clinical images were sourced from two independent centers. Device type, image resolution, and camera models are not reported. Color calibration procedures included haze removal (to mitigate specular reflection/moisture fog) and reflectance normalization to minimize lighting variability.
Region-of-interest extraction proceeded via a DeepLabV3+ semantic segmentation model, with manual mask refinement in ITK-SNAP to optimize anatomical accuracy. All released images are tongue-segmented, background-free photographs.
For downstream model training, data augmentation (including RandAugment and random erasing) was used only during classifier fine-tuning to upsample minority (rare-label) classes, and is not part of the canonical dataset (Kong et al., 13 Nov 2025).
4. Dataset Properties, Imbalance, and Unlabeled Pretraining Pool
The label distribution spans several orders of magnitude, with pronounced skew toward common presentations (e.g., thin coating, light-red color) and strong underrepresentation for rarer pathologies. This imbalance must be explicitly addressed in modeling, for example, via asymmetric loss or boosting ensembles.
No demographic or clinical metadata—subject age, sex, comorbidity—accompanies the released dataset. Image acquisition metadata (e.g., device make/model, camera settings) is similarly absent, complicating certain analyses of confounders or device effects.
An auxiliary set of 15,905 unlabeled tongue images is published for self-supervised pretraining (e.g., with masked autoencoders), supporting transfer learning and improved representation fidelity in settings with annotation scarcity (Kong et al., 13 Nov 2025).
5. Benchmarks, Metrics, and Machine Learning Baselines
TongueAtlas-4K establishes a benchmark task suite for multi-label tongue diagnosis. Baseline performance was assessed using a 10% held-out test set (averaged across five runs) for a selection of modern computer vision models, using standard metrics for multi-label classification: Precision, Recall (Sensitivity), F1-score (including Macro-, Micro-, and Example-F1), and Macro PR-AUC.
| Method | Macro-F1 | Macro Precision | Macro Recall | Macro PR-AUC |
|---|---|---|---|---|
| LGAN | 0.397 | 0.505 | 0.369 | 0.492 |
| YOLO12-CLS | 0.290 | 0.379 | 0.275 | 0.401 |
| Faster R-CNN | 0.381 | 0.485 | 0.339 | 0.493 |
| IFRCNet | 0.246 | 0.315 | 0.246 | 0.492 |
| DenseNet121 | 0.403 | 0.487 | 0.364 | 0.351 |
| C-GMVAE | 0.346 | 0.459 | 0.305 | 0.526 |
Macro-F1 scores remain modest (≤0.403), reflecting class imbalance and the inherent subtlety of the task. LGAN and DenseNet121 perform best by Macro-F1; C-GMVAE offers the highest Macro PR-AUC. Class-frequency skew and inter-label co-occurrence pose persistent challenges (Kong et al., 13 Nov 2025).
6. Recommendations, Caveats, and Intended Applications
Researchers must address severe label imbalance and lack of subject-level metadata. Specialized techniques such as asymmetric loss, targeted augmentation, and boosting are recommended for rare labels. The lack of standardized imaging conditions and demographic annotation constrains certain analyses; future work could extend the dataset protocol to cover these confounders.
Use cases include multi-label classification research, benchmarking new computer vision and machine learning frameworks in TCM, and self-supervised pretraining on unlabeled tongue data. While the dataset’s ontology is tongue-specific, its architecture and imbalance characteristics suit it as a tractable surrogate for broader multi-label, small-cohort medical image tasks (Kong et al., 13 Nov 2025).
7. High-Resolution (4K) Tongue Motion Atlas in MRI: Conceptual Extension
In parallel with the photographic dataset, a high-resolution statistical multimodal atlas of 4D tongue motion for speech (e.g., “Speech Map” approach) provides the methodological precedent for constructing an anatomical and functional “TongueAtlas-4K” using cine- and tagged-MRI (Woo et al., 2017).
This concept involves the following stages:
- Anatomical Reference Construction: Group-wise diffeomorphic registration (ANTs/SyN) of cine-MRI stacks generates an unbiased template with mappings between subject and atlas spaces, scalable to ≤0.6 mm isotropic “4K” resolution.
- Motion Estimation: Phase Vector Incompressible Registration Algorithm (PVIRA) processes CSPAMM tagged MRI to extract subject motion fields , enforcing incompressibility in tissue masks.
- Atlas-Space Transformation: Individual motion fields are conjugated into template space ().
- Quantitative Motion/Strain Analysis: Displacement , deformation gradient , and Lagrangian strain allow computation of principal strains and kinematic quantities, with scalar summaries such as mean displacement and maximum shear mapped across populations.
- Low-Rank Variability Modeling: Principal Component Analysis (PCA) on high-dimensional deformation fields (, number of mask voxels) yields the dominant axes of normal and pathological variability for speech motor tasks.
Application domains range from speech science to surgical simulation, neurodegenerative diagnostics, and as initialization data for biomechanical finite-element modeling. The protocol operationalizes full 4D (3D+time) motion fields in a fully Lagrangian, group-comparable frame (Woo et al., 2017).
A plausible implication is that the standards and protocols developed for photographic diagnostic datasets and 4D MRI-based motion atlases are converging, enabling cross-modal analysis and integrative research in tongue function, pathology, and computer-assisted diagnosis.