AGGRNet: Enhanced Medical Image Classification

Updated 22 November 2025

AGGRNet is a neural network framework for medical image classification that separates informative and non-informative regions using adaptive dual attention.
The paper demonstrates that two-stage feature extraction and contrastive cross-attention yield up to a 5.5% accuracy improvement on datasets like Kvasir.
Its modular FEA module and YOLOv11 backbone enable effective localization and fine-grained analysis in diverse medical imaging tasks.

AGGRNet is a neural network framework for enhanced medical image classification, specifically designed to address challenges in distinguishing subtle inter-class differences and managing intra-class variability in complex medical imaging tasks. Built upon a YOLOv11-based backbone, AGGRNet introduces a two-stage feature selection and aggregation strategy. The core innovation lies in explicitly separating “informative” and “non-informative” regions via adaptive dual attention, followed by contrastive cross-attention for context-aware feature integration. Empirical results demonstrate consistent improvements over state-of-the-art approaches, with gains as high as 5% on the Kvasir dataset for gastrointestinal tract image classification (Makwe et al., 15 Nov 2025).

1. Architectural Overview

AGGRNet consists of three principal components: a lightweight backbone network, a Feature Extraction and Aggregation (FEA) module repeatedly inserted at multiple depths, and a classification head. The backbone leverages YOLOv11’s C3K2 blocks for feature extraction, the SPPF (Spatial Pyramid Pooling-Fast) module for computationally efficient multiscale pooling, and a C2PCA block for improved channel attention. The FEA module is strategically positioned at three separate locations within the backbone, operating on intermediate feature maps to decompose them into diagnostically relevant (“informative”) and less relevant (“non-informative”) content, subsequently re-aggregating them with a global context via a contrastive attention mechanism.

2. Selective Feature Extraction and Aggregation

The FEA module is subdivided into:

Feature Extraction Module (FEM): Utilizes both spatial and channel attention to produce an attention score map $S$ . A learnable threshold $\tau$ (initialized to 0.5) binarizes $S$ to generate masks $W_{\mathrm{info}}$ and $W_{\mathrm{ninfo}}$ , segmenting feature maps into $X_{\mathrm{info}}$ (informative) and $X_{\mathrm{ninfo}}$ (non-informative) components. The mechanisms are defined as:

$S = \sigma(\mathrm{SA}(X) + \mathrm{CA}(X))$

$W_{\mathrm{info}}[i,j,k] = \begin{cases} 1 & \text{if } S[i,j,k] \geq \tau \ 0 & \text{otherwise} \end{cases}$

$W_{\mathrm{ninfo}} = 1 - W_{\mathrm{info}}$

$X_{\mathrm{info}} = W_{\mathrm{info}} \odot X; \quad X_{\mathrm{ninfo}} = W_{\mathrm{ninfo}} \odot X$

Feature Aggregation Module (FAM): Implements a contrastive cross-attention mechanism. The query, key, and value projections are constructed from $X_{\mathrm{info}}$ and $X_{\mathrm{ninfo}}$ :

$Q = X_{\mathrm{info}} + X_{\mathrm{ninfo}}$

$K = X_{\mathrm{info}} - X_{\mathrm{ninfo}}$

$V = X_{\mathrm{info}}$

Attention is then calculated as:

$\text{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

The aggregated features $X_{\mathrm{agg}}$ are fused via a residual summation:

$X_{\mathrm{out}} = X + X_{\mathrm{agg}}$

This selective extraction and subsequent attention-driven aggregation enable focused modeling of clinical regions, discriminating pathological versus normal tissue and reducing false activations.

3. Detailed Workflow and Forward Pass

At inference and training time, AGGRNet proceeds through the following high-level stages:

Initial convolutional stem and early backbone processing (C3K2, etc.).
Intermediate stage(s) before each FEA module insertion, where the input feature map undergoes FEM-based partitioning and FAM-based aggregation, yielding $X_{\text{out}} = X + X_{\text{agg}}$ .
Successive backbone stages repeat the FEA procedure at multiple depths.
Final backbone output is pooled with SPPF.
Global average pooling, fully connected classification, and softmax for class probability estimation.

A simplified pseudocode representation for a single FEA insertion is as follows:

def AGGRNet_Forward(image):
    X0 = Backbone_Stem(image)
    X1 = Backbone_Stage1(X0)
    X1_info, X1_ninfo = FEM(X1)
    X1_agg = FAM(X1_info, X1_ninfo)
    X1_out = X1 + X1_agg
    X2 = Backbone_Stage2(X1_out)
    # ... (additional FEA insertions)
    X_final = SPPF(Backbone_FinalStage(X_prev))
    pooled = GlobalAvgPool(X_final)
    logits = FC(pooled)
    probs = softmax(logits)
    return probs

4. Training Protocols and Datasets

AGGRNet is evaluated on five prominent medical imaging datasets: LIMUC (ulcerative colitis severity, 11,276 images), ISIC 2018 (skin lesion classification, 10,015 images), Kvasir (gastrointestinal imaging, 4,000 images across 8 classes), PathMNIST (histopathology, 107,180 images), and RetinaMNIST (diabetic retinopathy, 1,600 images). Input images are resized to $224 \times 224$ and normalized per standard practice. Optimization employs SGD (lr=0.01, momentum=0.937, weight decay= $5 \times 10^{-4}$ ) with ImageNet-pretrained YOLOv11 weights, and training is conducted end-to-end. Loss functions include cross-entropy for subtype classification, while ordinal tasks (LIMUC) use Quadratic Weighted Kappa (QWK) and mean absolute error (MAE) as auxiliary metrics. Performance is reported using accuracy, Macro-F1, precision, recall, and AUC where appropriate.

5. Empirical Results and Benchmark Comparisons

Key quantitative results across benchmarks are summarized in the following table:

Dataset	Baseline Model	Baseline Acc./Macro-F1	AGGRNet Acc./Macro-F1	Δ Accuracy / Macro-F1
Kvasir (8-cls)	HiFuse-Small	0.8612 / 0.8613	0.9160 / 0.9160	+0.0548 / +0.0547
LIMUC (MES 0–3)	CDW-CE (Inc-v3)	0.788 / 0.726	0.810 / 0.739	+0.022 / +0.013
ISIC 2018 (7-cls)	HiFuse-Base	0.8585 / 0.7532	0.8710 / 0.7710	+0.0125 / +0.0178
PathMNIST (9-cls)	HiFuse-Base	0.911	0.926	+0.015
RetinaMNIST (5-grades)	HiFuse-Base	0.528	0.537	+0.009

AGGRNet yields a maximum improvement of approximately 5.5% accuracy on the Kvasir dataset, outperforming existing attention-based and fusion benchmarks. For ordinal grading on LIMUC, improvements are observed not only in accuracy and Macro-F1, but also in QWK (0.881 vs. 0.868) and MAE (0.194 vs. 0.215).

6. Component Analysis and Ablation Studies

Several ablation experiments illuminate the architectural design:

Channel Attention – C2PCA vs. C2PSA: Replacing C2PSA with C2PCA (YOLOv11+C2PCA backbone) increases accuracy from 0.775 to 0.793. This indicates that channel attention (C2PCA) more effectively complements the global FEA module.
FEA Insertions: Performance monotonically increases with additional FEA insertions: a single FEA module yields accuracy of 0.791, two modules yield 0.793, and three achieve 0.807. Adding SPPF as final pooling further increases accuracy to 0.810.
Mechanistic Justification: Selective extraction limits activation on non-relevant regions, reducing false positives. The contrastive cross-attention (Q=info+ninfo, K=info−ninfo) arranges for higher attention in regions where $\|X_{\mathrm{info}}\| \gg \|X_{\mathrm{ninfo}}\|$ , mathematically favoring clinically salient regions.

Empirical heatmaps via Grad-CAM confirm this behavior, showing better localization and higher activation on pathological structures compared to broader, less focused attention in baseline models.

7. Broader Context, Limitations, and Significance

AGGRNet’s selective extraction and contrastive feature integration methodology demonstrates generalizability across multiple medical image domains with various class counts, image types, and annotation regimes. The modularity of the FEA unit allows integration with other advanced backbones and pooling strategies. A plausible implication is that adaptive thresholding and explicit modeling of non-informative regions will be increasingly important in future deep medical image classifiers, especially for fine-grained and ordinal tasks. The strengths documented in both ordinal disease grading and multiclass subtype differentiation position AGGRNet as a competitive reference for subsequent developments in medical image analysis (Makwe et al., 15 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

AGGRNet: Selective Feature Extraction and Aggregation for Enhanced Medical Image Classification (2025)

Follow Topic

Get notified by email when new papers are published related to AGGRNet Framework.