CAFCT-Net: Hybrid CT Analysis

Updated 5 February 2026

CAFCT-Net is a dual-model framework that integrates CNN-Transformer fusion for liver segmentation and MIL-based attention for cardiac CT analysis.
It employs advanced modules such as Attentional Feature Fusion, Atrous Spatial Pyramid Pooling, and Attention Gates to enhance feature extraction and segmentation accuracy.
The cardiac variant strategically fuses geometric and textural features via 3D-VCAE and attention-based MIL to robustly detect functionally significant coronary stenosis.

CAFCT-Net refers to two independent models with distinct architectures and biomedical applications. One pertains to a CNN–Transformer hybrid network for liver tumor segmentation in CT, as described in "CAFCT-Net: A CNN-Transformer Hybrid Network with Contextual and Attentional Feature Fusion for Liver Tumor Segmentation" (Kang et al., 2024). The other, described in Zreik et al. (2019) (Zreik et al., 2019), is a multi-view, attention-based multiple instance learning (MIL) architecture for detection of functionally significant coronary artery stenosis via cardiac CT angiography (CCTA). Each network incorporates complex feature fusion and attention mechanisms but targets different clinical tasks, input modalities, and architectural principles.

1. Overview of CAFCT-Net Variants

CAFCT-Net for liver tumor segmentation (Kang et al., 2024) is a dual-encoder, single-decoder model synthesizing local and global information via a CNN branch and a Transformer branch, respectively. CAFCT-Net in cardiac CT classification (Zreik et al., 2019) combines variational and convolutional autoencoders (CAEs) for structured feature extraction of coronary vessels and myocardium, followed by MIL pooling with attention for patient-level disease determination.

CAFCT-Net Reference	Domain	Core Architecture	Primary Task
(Kang et al., 2024)	Abdominal CT	CNN–Transformer hybrid, U-shape	Liver tumor semantic segmentation
(Zreik et al., 2019)	Cardiac CT Angio	3D-VCAE + 1D-CAE + CNN/CAE + MIL	Patient-level detection of functionally significant stenosis

2. CAFCT-Net for Liver Tumor Segmentation

CAFCT-Net (Kang et al., 2024) targets precise semantic segmentation of hepatic tumors on the LiTS CT benchmark. The architecture employs:

Dual Encoders:
- CNN branch: Downsampling with Conv+BN+ReLU and max pooling, doubling channel dimensions across stages.
- Transformer branch: Patchifying input slices (patch size 16), linear embedding, and sequential Transformer encoder blocks.
Attentional Feature Fusion (AFF):
- Canonical squeeze-and-excitation mechanism: concatenation of CNN and Transformer features, global average pooling, a two-layer gating MLP with reduction ratio $r{=}16$ , sigmoid excitation, and channel-wise feature reweighting.
Atrous Spatial Pyramid Pooling (ASPP):
- Parallel convolutions capturing multi-scale features via dilation rates $6$, $12$, and $18$, together with 1x1 convolutions and global average pooling, concatenated and depth-reduced to maintain channel count.
Attention Gates (AGs):
- Skip connections are modulated by AGs performing spatial channel fusion via additive combination and nonlinearity (ReLU), followed by sigmoid mapping to generate attention coefficients, which are multiplied element-wise with the encoder features.
Decoder:
- Progressive upsampling, concatenation with AG-modulated skip connections, convolutional refinement, and sigmoid output for pixel-wise segmentation.
Loss Function:
- Combined binary cross-entropy (BCE) and Dice loss: $L = \lambda L_{CE} + (1-\lambda) L_{Dice}$ with $\lambda {=} 0.5$ .

Data and Optimization

Dataset: LiTS, 200 contrast-enhanced CT volumes ([−200, 250] Hounsfield Units, resized to 512×512).
Training Parameters: SGD optimizer (momentum $0.9$, decay $1\mathrm{e}{-4}$ ), learning rate $0.001$ with step decay, batch size $8$, $100$ epochs.
Augmentation: Flips, rotations ( $\pm15^\circ$ ), scaling ($0.9$–$1.1$).

Performance

CAFCT-Net achieves $82.82\%$ mIoU and $90.38\%$ Dice, outperforming U-Net, H-DenseUNet, DeepLabv3+, Attention U-Net, and PVTFormer. It delivers sharper boundary localization, suppression of false positives, and better retention of irregular tumor morphology.

Method	mIoU (%)	Dice (%)
U-Net	59.43	67.50
H-DenseUNet	69.54	82.40
DeepLabv3+	72.33	82.97
Attention U-Net	76.33	84.06
PVTFormer	78.46	86.78
CAFCT-Net	82.82	90.38

Ablative analysis indicates a $2.0\%$ Dice decrease when omitting AFF, $1.5\%$ loss for ASPP removal, and $1.8\%$ loss for exclusion of AGs.

3. CAFCT-Net for Cardiac CT Angiography

CAFCT-Net as described in (Zreik et al., 2019) addresses non-invasive detection of functionally significant stenosis ( $\text{FFR} \leq 0.8$ ) by fusing comprehensive geometric and textural features from both the coronary artery tree and LV myocardium.

Coronary Artery Processing:
- Stage I: 3D variational CAE (3D-VCAE) encodes $40\times40\times5$ centerline-normal MPR patches into $16$-d latent vectors per patch.
- Stage II: Per-channel 1D CAEs summarize each artery into a fixed $1024$-dim vector via channel-wise sequence encoding.
Myocardium Processing:
- Myocardium segmentation via a multi-scale, multi-view 3D CNN.
- Patches within myocardium mask are clustered; 3D CAE encodes structural clusters, and final feature is concatenation of the vector mean and variance ($256$-dim).
Attention-based Multiple Instance Learning:
- For each patient, per-artery and global myocardium features are fused via small fully connected subnetworks, then aggregated using attention-based MIL pooling as in Ilse et al.
- Final output is a patient-level FFR ≤ 0.8 probability.

Optimization and Evaluation

Data: 126 patient CCTA, centerlines and masks fully automatic; no hand-crafted features or manual corrections.
Training: Adam optimizer ( $1\mathrm{e}{-4}$ ), stratified 5-fold split, 200,000 iterations, ensemble of last 10 snapshots.
Losses: CAE pretraining with VAE or MSE loss; MIL classifier with cross-entropy.

Configuration	AUC ( $\pm$ std)	Sens (0.70)	Spec (0.70)
Coronary + myocardium	0.74 ± 0.01	0.70	0.70
Coronary only	0.62 ± 0.01	0.70	0.46
Myocardium only	0.66 ± 0.03	0.70	0.55

Comparison with FFR-CT flow-based methods shows higher absolute performance for flow-based techniques (AUC $\sim$ 0.90), but at the cost of increased manual pre-processing.

4. Core Modules and Innovations

Attentional Feature Fusion (AFF)

Concatenates local (CNN) and global (Transformer) features, applies global pooling, and modulates channels via two-layer squeeze-and-excitation. Demonstrated to be critical for harmonizing encoder representations, with direct performance degradation on ablation.

Atrous Spatial Pyramid Pooling (ASPP)

Used only in (Kang et al., 2024), ASPP incorporates multiscale context to decoder inputs, preserving boundary cues, and resolving large-scale lesion structure. Quantitatively, ablation experiments attribute $\sim1.5\%$ Dice improvement in large tumor delineation.

Attention Gates (AGs)

Gating skip connections between encoder and decoder using a learnable, spatially-varying mask. AGs facilitate suppression of irrelevant context and preservation of salient regions, with ablation indicating their role in Dice performance ( $\sim1.8\%$ ).

Attention-based MIL Pooling

In the cardiac CT variant, attention scores dynamically weight per-artery instances in patient-level classification, optimizing the contribution of each coronary segment and facilitating learning in the presence of variable vessel count.

5. Limitations and Future Directions

CAFCT-Net for liver tumor segmentation incurs higher computational and memory requirements due to the dual encoder design and attention mechanisms. Performance on very small lesions is suboptimal, attributed to the pixel-averaged loss functions disproportionately penalizing small structures. Future directions include integration of lightweight Transformer modules, dynamic selection of contextual fusion modules, and volumetric extension to enforce 3D spatial consistency (Kang et al., 2024).

For the cardiac CT application, current limitations involve inability to leverage full end-to-end retraining and lack of fine-grained vessel–myocardium segment pairing. Proposed improvements include joint optimization of all modules and anatomically localized pairing using a 17-segment myocardial bullseye model (Zreik et al., 2019).

6. Significance and Comparative Analysis

CAFCT-Net in both manifestations demonstrates the importance of contextual, multiview, and attentional fusion in complex medical image analysis. In liver tumor segmentation, the synergy between CNN and Transformer encoders surpasses prior state-of-the-art networks in both mean IoU and Dice metrics. In cardiac CT, the fusion of geometric and textural myocardium features with attention-based pooling significantly outperforms single-branch alternatives.

In summary, the CAFCT-Net nomenclature encompasses architectures sharing a multi-module, attention-centric design philosophy but instantiated for distinct clinical and imaging contexts, each illustrating the benefits and challenges of deep multimodal feature fusion in biomedical artificial intelligence (Kang et al., 2024, Zreik et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

CAFCT-Net: A CNN-Transformer Hybrid Network with Contextual and Attentional Feature Fusion for Liver Tumor Segmentation (2024)

Combined analysis of coronary arteries and the left ventricular myocardium in cardiac CT angiography for detection of patients with functionally significant stenosis (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CAFCT-Net.