Transformer-Enhanced 2D Segmentation Models

Updated 17 December 2025

Transformer-enhanced 2D segmentation models are hybrid architectures that combine the global context modeling of transformers with the spatial specificity of CNNs.
These models employ multi-scale feature fusion, efficient tokenization, and dynamic attention mechanisms to address limitations like restricted receptive fields and attention collapse.
They have demonstrated significant accuracy improvements in medical and biomedical image segmentation through strategies such as class-aware attention, adversarial feedback, and efficient architecture design.

Transformer-Enhanced 2D Segmentation Models encompass a range of architectures that integrate the global context modeling capability of transformers with the spatially-local strengths of convolutional neural networks (CNNs) for high-accuracy 2D image segmentation. These models address key limitations of vanilla CNNs and transformers—namely, the restricted receptive field and attention collapse, respectively—yielding state-of-the-art results across medical, biomedical, and surgical image segmentation tasks. Transformer enhancement refers both to hybrid designs, in which transformer blocks are interleaved or fused with CNN layers, and to principled tokenization, attention, and fusion mechanisms devised for the unique challenges of 2D segmentation.

1. Foundational Architectures and Hybridization Strategies

Early transformer-enhanced segmentation models adopted U-shaped encoder–decoder topologies inspired by U-Net, typically replacing or augmenting convolutional bottlenecks with transformer blocks. A canonical example is CASTformer (You et al., 2022), whose generator network (CATformer) utilizes a four-stage hybrid CNN + Transformer encoder and an “All-MLP” decoder. Key architectural elements include:

Multi-scale feature extraction via feature pyramids (e.g., the $F_1$ … $F_4$ maps in CASTformer), with each scale processed by transformer blocks after patchification.
Dual-branch designs, including parallel ResNet and transformer encoders as in ParaTransCNN (Sun et al., 27 Jan 2024) and PCT-Fusion (Tiwari, 10 Jan 2024), with channel-attention–based fusion.
Hierarchical transformers with explicit patch merging and windowed self-attention (e.g., Swin Transformer–style stages in MISSFormer (Huang et al., 2021), UKAST (Sapkota et al., 6 Nov 2025), and BEFUnet (Manzari et al., 13 Feb 2024)).
Plug-and-play transformer blocks—ConvFormer (Lin et al., 2023) replaces vanilla ViT attention and MLP blocks with convolutional pooling, CSA, and CFFN.
Axial and spatial fusion—AFTer-UNet (Yan et al., 2021) models both intra-slice and inter-slice dependencies via axial fusion transformers.

Recent models often combine multiple of these strategies to further address domain-specific constraints such as boundary precision (BEFUnet), surgical instrument discrimination (TAFE (Yuan et al., 23 Oct 2024)), or parameter efficiency (TransUKAN (Wu et al., 23 Sep 2024), UKAST).

2. Transformer Enhancement Mechanisms

Enhancement mechanisms in hybrid models operate at several granularities:

Class-Aware and Local Attention: CASTformer’s Class-Aware Transformer (CAT) adaptively samples “interesting” image locations at each stage by iteratively updating sampling coordinates $s_t$ , then fuses sampled tokens with positional embeddings before applying transformer attention (You et al., 2022). BEFUnet’s LCAF module restricts cross-attention to spatially co-located patches, substantially reducing the computational cost compared to global cross-attention (Manzari et al., 13 Feb 2024).
Explicit Asymmetric and Feature-Prior Modules: TAFE introduces Asymmetric Feature Enhancement (AFE), with separate branches for anatomical and instrument features, cascaded strip/convolution operations, and gated fusion (Yuan et al., 23 Oct 2024).
Efficient Tokenization and Patch Embedding: ConvFormer avoids learned positional embeddings by adopting convolution + pooling for patch representation, enabling CSA blocks to maintain local spatial bias while allowing adaptively global attention (Lin et al., 2023). FCT (Tragakis et al., 2022) uses depthwise convolutions for patch embedding and K/Q/V projections, obviating the need for position encodings.
Feed-Forward Network Redesign: MISSFormer’s Enhanced Transformer Block uses recursive skip-connections and Mix-FFN with convolutional augmentations for better locality (Huang et al., 2021). TransUKAN and UKAST replace MLPs and QKV projections with rational KAN schemes (EfficientKAN and GR-KAN), achieving parameter reductions with robust nonlinear modeling (Wu et al., 23 Sep 2024, Sapkota et al., 6 Nov 2025).
Dynamic Deformable Operations: TEC-Net’s CNN branch employs Dynamic Deformable Convolutions (DDConv) for adaptive feature extraction, while the transformer branch builds cross-dimensional attention via SW-ACAM modules (Sun et al., 2023).

3. Multi-Scale and Contextual Fusion Paradigms

Fusing local and global information is central to transformer-enhanced segmentation models:

Multi-scale feature fusion is typically implemented via concatenation or gating following decoder upsampling, as in CATformer, ParaTransCNN, and SeUNet-Trans (Pham et al., 2023).
Context bridges: MISSFormer’s Enhanced Transformer Context Bridge (ETCB) concatenates and processes multi-resolution embeddings, modeling dependencies across both scale and location (Huang et al., 2021).
Channel attention modules: ParaTransCNN merges CNN and transformer features via squeeze-and-excitation attention at each scale (Sun et al., 27 Jan 2024).
Dot-product and attention-guided fusion: PCT-Fusion (Tiwari, 10 Jan 2024) multiplies projected transformer and CNN features at each scale, applies channel and spatial attention, then aggregates outputs with residual connections.
Double-level class-token fusion: BEFUnet’s DLF fuses shallow and deep features by cross-attention over class tokens derived via global average pooling (Manzari et al., 13 Feb 2024).
Adversarial discriminator feedback: CASTformer employs a transformer-based discriminator that judges the realism of pixel-wise masked images, incentivizing the generator to produce masks aligned with anatomical and semantic texture (You et al., 2022).

4. Training Protocols, Loss Functions, and Data Efficiency

Most architectures employ U-shaped encoder–decoder networks with skip connections and deep supervision:

Losses: Combining Dice loss and pixel-wise cross-entropy is standard (e.g., FCT, PARAtransCNN, TEC-Net). CASTformer augments these with WGAN-GP adversarial losses (You et al., 2022).
Parameter and computational efficiency: Models such as TransUKAN, UKAST, ConvFormer, and TEC-Net show that efficient transformer enhancement can yield large accuracy gains with reduced or only modest increases in parameter/FLOP count (see Tables in (Wu et al., 23 Sep 2024, Sapkota et al., 6 Nov 2025, Lin et al., 2023)).
Transfer learning: Pre-trained backbones (e.g., ImageNet initializations for ResNet, ViT) are critical for optimal performance in data-scarce domains (CASTformer, BEFUnet, SeUNet-Trans). Transfer learning can yield up to 7% Dice improvement (You et al., 2022).
Few-shot and episodic training: TRFS (Sun et al., 2021) trains on support-query pairs, using transformer blocks for global enhancement and CNN modules for local refinement, achieving higher mIoU in few-shot segmentation settings.

5. Quantitative Performance and Model Comparison

Transformer-enhanced segmentation models consistently surpass prior CNN-only and transformer-only methods across diverse medical segmentation tasks.

Model	Dice (Synapse)	Dice (ISIC)	Params (M)	Comments
TransUNet	77.48	–	105	CNN encoder + ViT bottleneck
CASTformer	82.55	–	–	CAT + multi-scale + adversarial discrim. (You et al., 2022)
ParaTransCNN	83.86	–	42.7	Parallel CNN+transformer, channel attention (Sun et al., 27 Jan 2024)
BEFUnet	80.47	0.868	–	Edge & body branches, local cross-attention (Manzari et al., 13 Feb 2024)
ConvFormer	–	0.889	–	Hybrid residual stem + Enhanced DeTrans (Gu et al., 2022)
FCT	83.53	0.94	31.7	Fully convolutional transformer (Tragakis et al., 2022)
MISSFormer	81.96	–	25	Enhanced transformer blocks + bridge (Huang et al., 2021)
TransUKAN	87.75 (Kvasir)	91.17	20.85	EfficientKAN, low param count (Wu et al., 23 Sep 2024)
UKAST	81.7 (Kvasir)	79.9	7.2	SwinT + GR-KAN, data-efficient (Sapkota et al., 6 Nov 2025)

Segmentation quality improvements are most pronounced on small, low-contrast, or structurally complex targets (pancreas, gallbladder, instrument threads) and in data-scarce scenarios (UKAST ablation, CASTformer transfer). Boundary delineation, robustness to annotation fuzziness, and fine-structure recovery are recurring benefits.

6. Design Insights, Ablations, and Open Issues

Extensive ablation studies have isolated the critical contributions of transformer-enhanced 2D segmentation mechanisms:

Class-aware / local attention modules are responsible for 2–3 pp Dice gain (CASTformer) (You et al., 2022).
EfficientKAN or rational FFN blocks cut parameter count 5x with no loss in accuracy (TransUKAN, UKAST) (Wu et al., 23 Sep 2024, Sapkota et al., 6 Nov 2025).
Enhanced context bridges and multi-scale attention yield 2–4 pp gains via improved feature fusion (MISSFormer, BEFUnet, PCT-Fusion).
Hybrid designs outperform purely CNN or transformer networks on virtually every benchmark; explicit scale merges (patch merging, context bridges, channel attention) are often more beneficial than self-attention alone (Roy et al., 2023).
Attention collapse is mitigated by CNN-style attention and convolutional feedforward blocks (ConvFormer) (Lin et al., 2023).
Parameter–FLOPs trade-offs are addressed by spatial-reduction attention (SeUNet-Trans, MISSFormer), local windowing (SwinT, TAFE, TEC-Net), and depthwise convolutions (FCT, ConvFormer).

Challenges include computational overhead for convolutional attention, generalization outside medical image modalities (TAFE, BEFUnet), and inference latency for high-capacity backbones.

7. Outlook and Generalization

Transformer enhancement strategies for 2D segmentation continue to evolve, with current research focusing on:

Generalization to non-medical domains: Asymmetric strip convolutions (TAFE) and local cross-attention (BEFUnet) may carry over to aerial, industrial, or remote sensing contexts.
Scalable attention and fusion: Rational KANs (UKAST, TransUKAN) and convolutional feedforward approaches (ConvFormer) suggest that expressivity and efficiency can be decoupled from network depth.
Adaptive fusion and dynamic context modeling: DDConv, multi-branch cross-attention, double-level fusion, and adversarial discriminators illustrate the utility of domain-adaptive, class-aware segmentation supervision.

Transformer-enhanced 2D segmentation models represent the convergence of global reasoning and spatially grounded image analysis, establishing new empirical and methodological standards for segmentation in medical and beyond.