MFTC-Net: Multi-Aperture 3D Segmentation
- MFTC-Net is a neural architecture that fuses multi-aperture Swin Transformers with parallel 3D convolutions to capture both global and local anatomical features without image down-sampling.
- Its specialized 3D fusion blocks integrate feature maps through channel and spatial recalibration, effectively combining long-range context with detailed edge preservation.
- Extensive evaluation on the Synapse dataset shows that MFTC-Net achieves superior Dice (89.73±0.04%) and HD95 (7.31±0.02) metrics with roughly 40M parameters, outperforming previous models.
The Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) is a neural architecture designed for precise 3D medical image segmentation. Integrating Swin Transformer “apertures” with parallel 3D convolutional blocks, MFTC-Net introduces a multi-aperture strategy that captures global and local anatomical features without down-sampling the original image intensity grid. A series of 3D fusion blocks facilitate context-aware feature integration, yielding improved segmentation accuracy and boundary preservation with notable parameter efficiency. The approach demonstrates state-of-the-art (SOTA) results on the Synapse multi-organ dataset, accompanied by detailed ablation and architectural analyses (Shabani et al., 2024).
1. Network Architecture and Multi-Aperture Strategy
MFTC-Net’s encoder employs four parallel Swin Transformer branches (denoted T₁–T₄), where each branch—or “aperture”—operates on a different spatial window of the same 3D input patch, consistently maintaining the original voxel grid resolution. For a patch at 1 mm isotropic resolution:
- T₁ processes the uncropped region,
- T₂ the central ,
- T₃ the central , and
- T₄ the central .
A corresponding set of 3D convolutional blocks (C₁–C₄) operate in parallel at each aperture. This design ensures that at all feature scales, high-frequency anatomical detail critical for accurate boundary delineation is preserved, as there is no intensity down-sampling through the network.
Each stage (where ) yields a pair of feature maps— from the Swin Transformer and from the convolutional block—which are merged using a 3D Fusion Block before progressing to the decoder pathway. The decoder itself employs symmetric 3D convolutional layers and skip-like connections to reconstruct the segmentation at full resolution.
2. Multi-Aperture 3D Fusion Block
At each aperture, a specialized 3D Fusion Block generates the fused feature representation by applying channel and spatial recalibrations and element-wise modulation: 0
- 1 denotes a Squeeze-&-Excitation operation for channel-wise feature weighting,
- “2” signifies the Hadamard (element-wise) product accentuating joint activations,
- 3 introduces spatial and channel attention mechanisms,
- 4 (with kernel size 5) projects the concatenated output back to the required channel dimension,
- and the final output is passed through a non-linearity, typically as 6 (plus the aforementioned recalibrations).
This fusion paradigm integrates long-range dependency modeling of Transformers with the local sensitivity of convolutional kernels.
3. Layerwise Structure and Parameter Profile
The following summarizes the hierarchy and parameter allocation of the MFTC-Net (values are approximate):
| Stage | Output Size | Channels | Parameters (M) |
|---|---|---|---|
| Patch Embedding | 7 | 48 | 0.8 |
| Swin T₁ (2 layers) | 8 | 48 | 3.4 |
| Downsample Conv | 9 | 96 | 0.5 |
| Swin T₂ (2 layers) | 0 | 96 | 6.8 |
| Downsample Conv | 1 | 192 | 1.2 |
| Swin T₃ (2 layers) | 2 | 192 | 13.6 |
| Downsample Conv | 3 | 384 | 2.4 |
| Swin T₄ (2 layers) | 4 | 384 | 13.6 |
| Fusion Blocks (all 4) | variable | match | 1.5 |
| Decoder (upconv/Conv3D) | 5–6 | 48–192 | 2.1 |
| Final 7 Conv | 8 | #organs | 0.1 |
| Total | ~40 |
The architecture achieves significant parameter efficiency, with 940M parameters—approximately half the parameter count of 3D TransUNet (81M). The network fits within a 12 GB GPU memory budget for 0 patches.
4. Optimization, Training Protocols, and Loss Formulation
MFTC-Net is optimized using a composite loss: 1 where 2 is the sum of Dice and cross-entropy losses, 3 the normalized signed distance map for class 4, 5 the corresponding binary surface indicator, and “6” denotes the element-wise product. The inclusion of the distance transform term penalizes misalignment at class surfaces, transferring a morphometric prior to the segmentation.
Training protocols include Adam optimization (learning rate 7, weight decay 8), standard 3D augmentation (affine, elastic, and intensity perturbations), and 5-fold cross-validation over the 30-case Synapse benchmark (18 training, 12 testing per fold, 300 epochs).
5. Quantitative Performance and Comparative Evaluation
On the Synapse multi-organ dataset, MFTC-Net achieves the following summary metrics (mean ± std):
- Dice coefficient: 9,
- HD95: 0, notably exceeding both UNETR1 (2, 3) and 3D TransUNet (4, HD95 not reported), with lower parameter count than either alternative. The method shows organ-wise improvements—particularly on spleen, kidneys, and aorta—while matching or exceeding prior results on liver, gallbladder, pancreas, and stomach.
Ablation analyses document the effect of the loss function and aperture depth: adding additional multi-aperture branches incrementally increases performance (from 5 Dice with a single aperture to 6 with all four and full fusion). Dice+CE+DistLoss consistently outperforms other loss combinations, culminating in the SOTA result.
6. Computational Characteristics and Ablation Analysis
Inference for a 7 volume completes in 80.8–1.2 seconds on an RTX3080 GPU. Ablations detailed in the original work include:
- Loss ablation: Highest Dice score (9) is achieved with the Dice+CE+DistLoss hybrid.
- Multi-aperture ablation: Each additional aperture introduces improved Dice, verifying the efficacy of multi-scale global context aggregation.
- Visualization: Qualitatively, MFTC-Net outputs cleaner, sharper boundaries and reduces disconnected false positives compared to Transformer or CNN-only baselines.
7. Design Principles and Segmentation Significance
MFTC-Net’s multi-aperture input strategy, in which intensities are never resampled, preserves critical organ surface geometry. Its fusion blocks enforce robust channel and spatial weighting by unifying the long-range context of Swin Transformers and the edge-focused sharpness of 3D convolutions. The morphometric (distance transform) loss adds a surface-aware penalty, directly optimizing for conformal boundary prediction as reflected in improved HD95 values.
Parameter efficiency is obtained through shallow Swin Transformer blocks and compact convolutional modules, demonstrating SOTA accuracy despite curtailed capacity. This suggests broader applicability to 3D medical segmentation tasks where high-resolution boundary preservation and computational efficiency are paramount.