Small Lesions-Aware Bidirectional Fusion Network
- The paper introduces MMCAF-Net, a novel architecture that uses bidirectional and multiscale fusion to significantly improve small-lesion detection in lung cancer diagnosis.
- It integrates volumetric PET/CT scans with structured EHR data using advanced attention mechanisms, achieving 10–15% AUROC improvements over baseline models.
- The model effectively addresses class imbalance and captures fine-scale lesion features with oversampling and bidirectional feedback, leading to state-of-the-art performance.
The Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network (MMCAF-Net) is an advanced deep learning architecture specifically developed to address the problem of small-lesion misdiagnosis in lung disease classification. By integrating volumetric PET/CT scans with structured electronic health records (EHR), MMCAF-Net resolves cross-modal, cross-scale representation challenges inherent to multimodal medical data. The network employs bidirectional, multiscale feature propagation, multi-scale cross-attention, and small-lesion-aware convolutional attention mechanisms. Quantitative evaluation on the Lung-PET-CT-Dx dataset demonstrates that MMCAF-Net achieves new state-of-the-art performance in binary classification of lung cancer subtypes, particularly in demanding settings with class imbalance and subtle lesion manifestations (Yu et al., 6 Aug 2025).
1. Architectural Overview
MMCAF-Net is structured in three principal modules:
- Vision Encoder: A 3D CNN backbone (PENet) augmented with a four-level feature pyramid, where each level incorporates an Efficient 3D Multi-Scale Convolutional Attention (E3D-MSCA) module. Bidirectional Feedback Propagation Units (BFPU) enable information flow both top-down and bottom-up between adjacent pyramid levels.
- Tabular Encoder: A Kolmogorov–Arnold Network (KAN) processes EHR/tabular data, mapping each patient’s feature vector into three scales of embeddings by successive linear projections, forming an inverted pyramid structure.
- Fusion and Classifier: At each scale, a Multi-Scale Cross-Attention (MSCA) module fuses the respective vision and tabular embeddings bidirectionally. Three cross-attended features are integrated by a Bidirectional Scale Fusion (BSF) block, producing a representation for final binary classification via a fully connected layer.
The bidirectional fusion mechanisms operate both within the vision encoder (via BFPU) and across vision-tabular modalities (via alternating query-key-value roles in MSCA blocks).
2. Multiscale Vision Encoding and Attention Mechanisms
MMCAF-Net extracts lesion-sensitive features from volumetric inputs through the E3D-MSCA, applied at each level of the vision encoder’s feature pyramid:
- 3D Channel Attention Block (CAB): Computes global average and max-pooling along the spatial (D, H, W) axes, followed by shared MLP projections and sigmoid gating to create a channel-wise attention mask:
with .
- 3D Spatial Attention Block (SAB): Averages and maximizes feature activations over channels, concatenates the results, and processes them via a 7×7×7 convolution and sigmoid to form a spatial mask:
with .
- 3D Depth-Wise Convolution Fusion Block (DCFB): Parallel depth-wise convolutions with varying kernel sizes (e.g., 3×3×3, 5×5×5, 7×7×7) are fused with a 1×1×1 convolution, yielding a multi-scale representation.
- Bidirectional Feedback Propagation Unit (BFPU): For adjacent scales and ,
propagating small-activation patterns bidirectionally between fine and coarse scales.
This hierarchical approach targets subtle, high-frequency lesion features and facilitates robust, scale-aware encoding.
3. Multimodal Fusion via Multi-Scale Cross-Attention
After spatial-scale-specific features are derived from the vision and tabular encoders, MMCAF-Net fuses these at each of three semantic scales using the MSCA module:
- Cross-Attention: For vision image feature and tabular embedding , features are projected to queries (), keys (), and values (), and processed by multi-head, scaled dot-product attention:
The operation is performed in both directions (imagetabular, tabularimage), and the outputs are summed.
- Bidirectional Scale Fusion (BSF): Cross-attended outputs are aligned linearly, importance weights computed:
and fused:
Final fusion concatenates the fused features from all scales before classification.
The MSCA and BSF mechanisms enable effective feature alignment and joint decision-making across imaging and EHR modalities, resolving dimensionality mismatches.
4. Small-Lesion Sensitivity and Class Imbalance Handling
Small-lesion detection is systematically embedded in MMCAF-Net through architectural and training design:
- The E3D-MSCA module emphasizes multi-scale, high-frequency volumetric features associated with small lesions.
- The BFPU ensures that fine-scale (high-resolution) information propagates upward and downward, maintaining sensitivity to subtle lesion cues at all feature hierarchy levels.
- Cross-attention at the finest scale enhances the integration of small-lesion indications between imaging and EHR.
- To address class imbalance, particularly for squamous cell carcinoma (with typically small lesions), the training data are oversampled so that minority classes are represented (squamous: 34 to 198), using random oversampling. No additional custom or focal loss adjustment is applied beyond standard class balancing techniques.
These methods collectively increase discriminatory power for challenging, low-prevalence lesion instances.
5. Training Protocol and Data Processing
MMCAF-Net is trained and evaluated on the Lung-PET-CT-Dx dataset (355 subjects, CT/PET and EHR):
- Splits: Training utilizes 251 adenocarcinoma and 198 oversampled squamous carcinoma cases; validation and testing have 12 and 15 instances per class, respectively, maintaining original ratios.
- Preprocessing: Each CT/PET volume is reduced to 12 slices (192×192 pixels), with normalization and data augmentation (random rotation, sharpening, intensity normalization). Tabular categorical features are one-hot or embedded; continuous features are standardized.
- Optimization: Training uses SGD with learning rate 1e-4, weight decay 1e-2, momentum 0.9, batch size 4 (4 patients × 12 slices), for 50 epochs. The binary cross-entropy loss is used:
- Algorithmic Flow: Training and inference protocols are precisely described in Algorithm 1 and Algorithm 2 of the paper.
6. Empirical Evaluation
Quantitative results on the Lung-PET-CT-Dx test set demonstrate that MMCAF-Net achieves strong performance compared to six contemporary multimodal methods:
| Method | AUROC | ACC | F1 | Spec | Sens | PPV | NPV |
|---|---|---|---|---|---|---|---|
| PECon | 0.786 | 0.744 | 0.645 | 0.786 | 0.667 | 0.625 | 0.815 |
| MedFuse | 0.786 | 0.721 | 0.571 | 0.821 | 0.533 | 0.615 | 0.767 |
| Drfuse | 0.613 | 0.676 | 0.353 | 0.800 | 0.333 | 0.375 | 0.769 |
| MMTM | 0.802 | 0.698 | 0.581 | 0.750 | 0.600 | 0.562 | 0.778 |
| PEfusion | 0.740 | 0.721 | 0.600 | 0.786 | 0.600 | 0.600 | 0.786 |
| daft | 0.727 | 0.729 | 0.667 | 0.786 | 0.650 | 0.684 | 0.759 |
| MMCAF-Net | 0.786 | 0.791 | 0.690 | 0.857 | 0.667 | 0.714 | 0.828 |
Key ablations demonstrate that the E3D-MSCA yields 10–15% improvements in AUROC and F1 over baseline PENet. The MSCA fusion mechanism outperforms cross-attention, late-fusion, and CLIP-aligned fusion alternatives by 5–12% in accuracy and 3–7% in AUROC.
7. Algorithmic Synopsis
The algorithmic foundations of MMCAF-Net are implemented as follows:
- Training: Each iteration applies vision encoding (feature pyramid, E3D-MSCA, BFPU), tabular encoding (KAN), multi-scale cross-attention and bidirectional scale fusion, followed by aggregation and binary classification. The total loss is minimized with SGD.
- Inference: Testing proceeds identically to training, omitting dropout.
A step-by-step pseudocode for both training and inference procedures is provided in the primary manuscript.
Summary
MMCAF-Net provides an effective, modular approach for combining volumetric imaging data with structured EHR, with particular focus on small-lesion sensitivity and scale-resolved information flow. Its technical contributions—including multiscale convolutional attention, cross-modal bidirectional attention, and integrated loss and data strategies—yield concrete advances in diagnostic accuracy, specificity, and negative predictive value for challenging clinical applications (Yu et al., 6 Aug 2025).