Medical Image Classification Models

Updated 26 October 2025

Medical image classification models are deep learning frameworks that assign diagnostic labels to images such as X-rays, CT scans, and histopathology slides.
They leverage advanced architectures including CNNs, Vision Transformers, and state-space models to capture both local textures and global context.
Innovative techniques like self-supervised contrastive learning, prompt-driven domain generalization, and federated learning enhance their robustness and real-world applicability.

Medical image classification models are computational frameworks, typically based on deep learning, which assign diagnostic categories to medical images such as radiographs, histopathology slides, fundus images, or computed tomography scans. Over the last decade, these models have advanced from specialized convolutional neural networks (CNNs) to transformer architectures, self-supervised pretraining frameworks, foundation models, and hybrid approaches that unify local and global context. The field is shaped by challenges intrinsic to medical imaging—scarcity of labeled data, high inter- and intra-class variability, domain shift from acquisition artifacts, and the critical need for robustness and explainability. The following sections summarize methodological trends, technical mechanisms, and key empirical findings from recent literature.

1. Deep Learning Architectures for Medical Image Classification

The evolution of deep medical image classifiers spans several architectural paradigms:

Convolutional Neural Networks (CNNs):

CNNs have been the primary architecture due to their inductive bias for translation invariance and locality. Early designs such as VGGNet and AlexNet laid the groundwork, while ResNet introduced skip connections to enable deeper models and mitigate vanishing-gradient problems. DenseNet uses dense connectivity to further promote feature reuse and parameter efficiency (Ali, 2023). Advanced variants, such as ResNet+ (Chaddad et al., 2 Jul 2025), incorporate modified downsampling (ResNet-D) and Convolutional Block Attention Modules (CBAM) into bottleneck layers to preserve information and enhance discriminative focus.

Vision Transformers (ViTs):

Transformers, originally developed for NLP, now underpin many medical imaging models due to their capacity to model long-range relationships using self-attention. ViT-based backbones, such as those benchmarked in (Wu et al., 24 Jan 2025, Mansoori et al., 26 May 2025), process images as sequences of non-overlapping patches, enabling global context modeling but also requiring large-scale pretraining to reach optimal performance. Studies consistently find that ViT-based models, especially under end-to-end fine-tuning on medical datasets, outperform CNN counterparts given sufficient data and compute.

State Space Models (SSMs):

MedMamba (Yue et al., 6 Mar 2024) introduces the Vision Mamba architecture, fusing grouped convolutional operations with state space modeling via the novel SS-Conv-SSM block. This dual-branch design efficiently balances extraction of local texture (via CNN branch) with long-range context (via SSM branch), offering linear computational complexity for dependency modeling—a marked advance over ViT’s quadratic scaling.

Kolmogorov-Arnold Networks (KANs):

MedKAN leverages KANs to replace fixed convolutions with learnable nonlinear transformations, captured by Local Information KAN (LIK) and Global Information KAN (GIK) modules (Yang et al., 25 Feb 2025). The architecture achieves superior representation of both fine morphological details and global tissue context, surpassing conventional CNNs and ViTs on several benchmark datasets.

2. Self-Supervised, Prompt-Driven, and Contrastive Learning

Medical imaging is typified by limited high-quality annotations, promoting reliance on unsupervised or semi-supervised representation learning:

Self-Supervised Contrastive Learning:

A potent paradigm involves contrastive pretraining, where models are trained to bring positive pairs (different augmentations of the same instance) closer in embedding space while pushing negatives apart. SimCLR-based NT-Xent loss is central:

$\ell_{i,j}^{\text{NT-Xent}} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]}\exp(\text{sim}(z_i, z_k)/\tau)}$

where $\text{sim}(\cdot, \cdot)$ is cosine similarity and $\tau$ a temperature parameter (Azizi et al., 2021).

Multi-Instance Contrastive Learning (MICLe):

MICLe advances over SimCLR by explicitly constructing positive pairs from different images of the same patient or pathology, enforcing invariance to acquisition settings and viewpoint. This yields substantial gains: e.g., a $6.7\%$ absolute increase in top-1 accuracy for dermatology classification and $1.1\%$ improvement in mean AUC for chest X-ray tasks (Azizi et al., 2021).

Prompt-Driven Latent Domain Generalization (PLDG):

PLDG (Yan et al., 5 Jan 2024) automatically discovers pseudo-domain labels via unsupervised clustering of shallow ViT features, then uses learnable domain-specific prompts to guide network attention. Cross-domain knowledge is further promoted via a prompt generator (low-rank factorization) and a domain mixup strategy, leading to strong out-of-distribution (OOD) robustness without relying on explicit domain annotations.

Contrastive Knowledge Distillation:

The CRCKD approach (Xing et al., 2021) addresses high intra-class variance and class imbalance by coupling class-guided contrastive distillation and a Categorical Relation Preserving (CRP) loss. CRP uses class centroids to anchor relational knowledge, offering resilience to label distributions skewed by data scarcity.

3. Foundation Models and Multimodal Strategies

Foundation Models:

Large-scale pre-trained networks (CNNs and ViTs) originally trained on natural images—termed "foundation models"—are increasingly repurposed for medical imaging. Benchmark studies (Wu et al., 24 Jan 2025, Mansoori et al., 26 May 2025) show that fine-tuned ViTs, including variants such as DINOv2 and AIMv2, routinely outperform smaller CNNs and domain-pretrained models, even with limited labeled medical data. Fine-tuning all model layers (vs. linear probing) delivers best performance, but linear classification on frozen embeddings can still produce competitive AUCs, especially when using CLIP or similar architectures (Khoiwal et al., 12 Dec 2024).

Vision-LLMs (VLMs):

Multimodal models like CLIP enable image-text alignment and have been applied to few-shot medical classification using GPT-4 generated shape/texture descriptors (Byra et al., 2023). The VLM’s class decision relies on semantic similarity between image and text embeddings:

$s(c, x) = \frac{1}{|D(c)|} \sum_{d\in D(c)} \varphi(d, x)$

where $D(c)$ is the set of descriptors for class $c$ and $\varphi(\cdot, \cdot)$ the VLM similarity function. Descriptor pruning based on few-shot data ( $n$ -shot selection) markedly improves classification performance and AUC stability.

Segmentation Foundation Models:

Segmentation models like SAM can be adapted for classification by using their encoders as frozen feature extractors. The Spatially Localized Channel Attention (SLCA) mechanism aggregates segmentation maps with spatial embeddings, emphasizing diagnostic regions and enhancing classification outcomes, even with minimal labeled data (Gu et al., 9 May 2025).

4. Robustness to Domain Shift and Generalization

Types of Domain Shift:

Domain shift impairs model generalization, typically through covariate shift (changes in $p(x)$ from factors such as scanner or protocol) or concept shift (changes in $p(y|x)$ from annotation noise or guideline updates) (Matta et al., 18 Mar 2024).

Data and Representation-Level Approaches:

To address these, methods utilize data homogenization (e.g., stain normalization, frequency alignment), advanced data augmentation, adversarial representation learning, and meta-learning. For example, meta-learning with algorithms such as MAML partitions data into meta-train/val splits to simulate domain adaptation (Matta et al., 18 Mar 2024).

Self-Supervised and Contrastive Pretraining:

Strong gains in OOD robustness arise from self-supervised frameworks that decouple feature learning from label noise and distributional quirks. MICLe and PLDG (see above) are leading strategies in this context. Empirically, these techniques enable models to achieve high AUCs across sites and protocols, and remain label-efficient under data scarcity (Azizi et al., 2021, Yan et al., 5 Jan 2024).

Test-Time Adaptation and Explainability:

Concept Bottleneck Models (CBMs), which predict interpretable latent concepts as an intermediate step, enable explainable medical image classification. Training-free, test-time adaptation strategies can enhance CBMs’ OOD accuracy by identifying and masking confusing concepts and amplifying under-activated ones, using minimal (<8) labeled images per class (He et al., 22 Jun 2025). This preserves both classification fidelity and interpretability, crucial for clinical deployment.

5. Computational Efficiency, Federated Learning, and Practical Deployment

Efficient Architectures:

Lightweight ConvNeXt-Tiny architectures, enhanced with dual global pooling (GAP and GMP), lightweight attention modules (e.g., SEVector), and feature smoothing losses, demonstrate high accuracy ( $\sim89\%$ ) under CPU-only conditions, converging rapidly in resource-constrained settings (Xia et al., 15 Aug 2025). Grouped convolutions, channel-shuffling, and linear computational complexity modules (e.g., Mamba-based SSMs) are necessary for deployment on limited hardware (Yue et al., 6 Mar 2024).

Federated Learning:

Federated strategies such as FedAVG and FedProx aggregate model updates from distributed clients, preserving patient privacy while benefiting from institution-spanning data diversity (Wu et al., 29 May 2025). Combining deep feature extractors with traditional ML classifiers (e.g., SVMs) further improves generalization to unseen domains.

Knowledge Distillation and Weight Selection:

To bridge the gap between large foundation models and deployable small architectures, dual-model weight selection and self-knowledge distillation (SKD) achieve performance and robustness approaching larger models without incurring their resource costs. Dual-model weight selection uses distinct, complementary subsets of teacher model weights for initialization, and SKD regularizes learning by transferring "soft targets" from an EMA-updated auxiliary model to the main lightweight learner. This ensures adaptation to low-data and low-resource scenarios without significant computational overhead (Tsutsumi et al., 28 Aug 2025).

6. Evaluation Metrics, Explainability, and Future Outlook

Metrics:

Standard metrics include accuracy, precision, recall, F1-score, and AUC; for ordinal tasks, Quadratic Cohen’s Kappa is also used (Mansoori et al., 26 May 2025). Balanced multiclass accuracy (BMA) and area under the ROC curve (AUC) are essential for imbalanced medical datasets.

Explainable AI (XAI):

Explainability is central for clinical acceptance. Techniques such as Class Activation Mapping (CAM), Grad-CAM, Grad-CAM++, and prototype-based transformers (e.g., ProtoPFormer, X-Pruner) localize salient regions or prototype features supporting a decision (Dao et al., 4 Jun 2025). Explainability also extends to zero-shot and few-shot VLM approaches, where text descriptors align predictions with human-interpretable knowledge.

Future Directions:

The integration of multimodal data (image, text, omics), hybrid models (e.g., CNN-transformer or SSM hybrids), advances in unsupervised and prompt-driven domain generalization, and the adoption of explainability standards will define future progress. Foundation models pre-trained on diverse, richly annotated clinical corpora and continual/federated learning strategies are essential for robust, adaptable diagnostic AI (Matta et al., 18 Mar 2024, Dao et al., 4 Jun 2025).

In summary, medical image classification models have evolved into complex, robust systems that leverage large-scale pretraining (self-supervised and multimodal), sophisticated architectural hybrids, and advanced generalization frameworks. Key research demonstrates that transfer learning from natural image models, contrastive and domain-aware representation learning, and explicit attention to explainability collectively enable models to achieve strong diagnostic accuracy, resilience to distribution shift, and practical applicability in clinical workflows. Future research is likely to focus on data-efficient self-supervision, unified foundation models, robust out-of-domain adaptation, and nuanced, explainable decision-making.