Papers
Topics
Authors
Recent
Search
2000 character limit reached

Small Lesions-Aware Bidirectional Fusion Network

Updated 9 February 2026
  • The paper introduces MMCAF-Net, a novel architecture that uses bidirectional and multiscale fusion to significantly improve small-lesion detection in lung cancer diagnosis.
  • It integrates volumetric PET/CT scans with structured EHR data using advanced attention mechanisms, achieving 10–15% AUROC improvements over baseline models.
  • The model effectively addresses class imbalance and captures fine-scale lesion features with oversampling and bidirectional feedback, leading to state-of-the-art performance.

The Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network (MMCAF-Net) is an advanced deep learning architecture specifically developed to address the problem of small-lesion misdiagnosis in lung disease classification. By integrating volumetric PET/CT scans with structured electronic health records (EHR), MMCAF-Net resolves cross-modal, cross-scale representation challenges inherent to multimodal medical data. The network employs bidirectional, multiscale feature propagation, multi-scale cross-attention, and small-lesion-aware convolutional attention mechanisms. Quantitative evaluation on the Lung-PET-CT-Dx dataset demonstrates that MMCAF-Net achieves new state-of-the-art performance in binary classification of lung cancer subtypes, particularly in demanding settings with class imbalance and subtle lesion manifestations (Yu et al., 6 Aug 2025).

1. Architectural Overview

MMCAF-Net is structured in three principal modules:

  • Vision Encoder: A 3D CNN backbone (PENet) augmented with a four-level feature pyramid, where each level incorporates an Efficient 3D Multi-Scale Convolutional Attention (E3D-MSCA) module. Bidirectional Feedback Propagation Units (BFPU) enable information flow both top-down and bottom-up between adjacent pyramid levels.
  • Tabular Encoder: A Kolmogorov–Arnold Network (KAN) processes EHR/tabular data, mapping each patient’s feature vector into three scales of embeddings by successive linear projections, forming an inverted pyramid structure.
  • Fusion and Classifier: At each scale, a Multi-Scale Cross-Attention (MSCA) module fuses the respective vision and tabular embeddings bidirectionally. Three cross-attended features are integrated by a Bidirectional Scale Fusion (BSF) block, producing a representation for final binary classification via a fully connected layer.

The bidirectional fusion mechanisms operate both within the vision encoder (via BFPU) and across vision-tabular modalities (via alternating query-key-value roles in MSCA blocks).

2. Multiscale Vision Encoding and Attention Mechanisms

MMCAF-Net extracts lesion-sensitive features from volumetric inputs through the E3D-MSCA, applied at each level of the vision encoder’s feature pyramid:

  • 3D Channel Attention Block (CAB): Computes global average and max-pooling along the spatial (D, H, W) axes, followed by shared MLP projections and sigmoid gating to create a channel-wise attention mask:

Mc=σ(W2ReLU(W1ψ(x))+W2ReLU(W1ϕ(x)))M_c = \sigma( W_2\,\text{ReLU}(W_1\,\psi(x)) + W_2\,\text{ReLU}(W_1\,\phi(x)) )

with x=Mcxx' = M_c \odot x.

  • 3D Spatial Attention Block (SAB): Averages and maximizes feature activations over channels, concatenates the results, and processes them via a 7×7×7 convolution and sigmoid to form a spatial mask:

Ms=σ(Conv7×7×7([ρ(x);μ(x)]))M_s = \sigma(\text{Conv}_{7\times7\times7}([\rho(x');\,\mu(x')]))

with x=Msxx'' = M_s \odot x'.

  • 3D Depth-Wise Convolution Fusion Block (DCFB): Parallel depth-wise convolutions with varying kernel sizes (e.g., 3×3×3, 5×5×5, 7×7×7) are fused with a 1×1×1 convolution, yielding a multi-scale representation.
  • Bidirectional Feedback Propagation Unit (BFPU): For adjacent scales FaF_a and FbF_b,

Fmid=σ(Conv3×3×3(Fa)Conv3×3×3(Fb)), Fout=[Fa+FmidFa,Fb+FmidFb]F_\text{mid} = \sigma(\text{Conv}_{3\times3\times3}(F_a) \odot \text{Conv}_{3\times3\times3}(F_b)),\ F_\text{out} = [F_a + F_\text{mid} \odot F_a,\, F_b + F_\text{mid} \odot F_b]

propagating small-activation patterns bidirectionally between fine and coarse scales.

This hierarchical approach targets subtle, high-frequency lesion features and facilitates robust, scale-aware encoding.

3. Multimodal Fusion via Multi-Scale Cross-Attention

After spatial-scale-specific features are derived from the vision and tabular encoders, MMCAF-Net fuses these at each of three semantic scales using the MSCA module:

  • Cross-Attention: For vision image feature IsI_s and tabular embedding TsT_s, features are projected to queries (QQ), keys (KK), and values (VV), and processed by multi-head, scaled dot-product attention:

Ah=Softmax(QhKhC),Oh=AhVhA_h = \text{Softmax}\left( \frac{Q_h K_h^\top}{\sqrt{C}} \right ), \quad O_h = A_h V_h

The operation is performed in both directions (image\rightarrowtabular, tabular\rightarrowimage), and the outputs are summed.

  • Bidirectional Scale Fusion (BSF): Cross-attended outputs Us,OsU_s, O'_s are aligned linearly, importance weights computed:

αs=Softmax(UˉsOˉs)\alpha_s = \text{Softmax}( \bar{U}_s \odot \bar{O}_s )

and fused:

Fsfuse=αsUˉs+(1αs)OˉsF_s^\text{fuse} = \alpha_s \odot \bar{U}_s + (1-\alpha_s) \odot \bar{O}_s

Final fusion concatenates the fused features from all scales before classification.

The MSCA and BSF mechanisms enable effective feature alignment and joint decision-making across imaging and EHR modalities, resolving dimensionality mismatches.

4. Small-Lesion Sensitivity and Class Imbalance Handling

Small-lesion detection is systematically embedded in MMCAF-Net through architectural and training design:

  • The E3D-MSCA module emphasizes multi-scale, high-frequency volumetric features associated with small lesions.
  • The BFPU ensures that fine-scale (high-resolution) information propagates upward and downward, maintaining sensitivity to subtle lesion cues at all feature hierarchy levels.
  • Cross-attention at the finest scale enhances the integration of small-lesion indications between imaging and EHR.
  • To address class imbalance, particularly for squamous cell carcinoma (with typically small lesions), the training data are oversampled so that minority classes are represented (squamous: 34 to 198), using random oversampling. No additional custom or focal loss adjustment is applied beyond standard class balancing techniques.

These methods collectively increase discriminatory power for challenging, low-prevalence lesion instances.

5. Training Protocol and Data Processing

MMCAF-Net is trained and evaluated on the Lung-PET-CT-Dx dataset (355 subjects, CT/PET and EHR):

  • Splits: Training utilizes 251 adenocarcinoma and 198 oversampled squamous carcinoma cases; validation and testing have 12 and 15 instances per class, respectively, maintaining original ratios.
  • Preprocessing: Each CT/PET volume is reduced to 12 slices (192×192 pixels), with normalization and data augmentation (random rotation, sharpening, intensity normalization). Tabular categorical features are one-hot or embedded; continuous features are standardized.
  • Optimization: Training uses SGD with learning rate 1e-4, weight decay 1e-2, momentum 0.9, batch size 4 (4 patients × 12 slices), for 50 epochs. The binary cross-entropy loss is used:

Lcls(y,y^)=1Ni=1N[yilogy^i+(1yi)log(1y^i)]\mathcal{L}_\text{cls}(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^N \left[ y_i \log \hat{y}_i + (1-y_i) \log (1-\hat{y}_i) \right]

  • Algorithmic Flow: Training and inference protocols are precisely described in Algorithm 1 and Algorithm 2 of the paper.

6. Empirical Evaluation

Quantitative results on the Lung-PET-CT-Dx test set demonstrate that MMCAF-Net achieves strong performance compared to six contemporary multimodal methods:

Method AUROC ACC F1 Spec Sens PPV NPV
PECon 0.786 0.744 0.645 0.786 0.667 0.625 0.815
MedFuse 0.786 0.721 0.571 0.821 0.533 0.615 0.767
Drfuse 0.613 0.676 0.353 0.800 0.333 0.375 0.769
MMTM 0.802 0.698 0.581 0.750 0.600 0.562 0.778
PEfusion 0.740 0.721 0.600 0.786 0.600 0.600 0.786
daft 0.727 0.729 0.667 0.786 0.650 0.684 0.759
MMCAF-Net 0.786 0.791 0.690 0.857 0.667 0.714 0.828

Key ablations demonstrate that the E3D-MSCA yields 10–15% improvements in AUROC and F1 over baseline PENet. The MSCA fusion mechanism outperforms cross-attention, late-fusion, and CLIP-aligned fusion alternatives by 5–12% in accuracy and 3–7% in AUROC.

7. Algorithmic Synopsis

The algorithmic foundations of MMCAF-Net are implemented as follows:

  • Training: Each iteration applies vision encoding (feature pyramid, E3D-MSCA, BFPU), tabular encoding (KAN), multi-scale cross-attention and bidirectional scale fusion, followed by aggregation and binary classification. The total loss is minimized with SGD.
  • Inference: Testing proceeds identically to training, omitting dropout.

A step-by-step pseudocode for both training and inference procedures is provided in the primary manuscript.

Summary

MMCAF-Net provides an effective, modular approach for combining volumetric imaging data with structured EHR, with particular focus on small-lesion sensitivity and scale-resolved information flow. Its technical contributions—including multiscale convolutional attention, cross-modal bidirectional attention, and integrated loss and data strategies—yield concrete advances in diagnostic accuracy, specificity, and negative predictive value for challenging clinical applications (Yu et al., 6 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network.