Dual-Encoder TransUNet Architecture

Updated 29 December 2025

The paper introduces a dual-encoder design that fuses parallel convolutional encoders before a transformer bottleneck to boost feature learning in segmentation tasks.
The architecture captures modality-specific and multi-scale features, achieving improved segmentation accuracy as demonstrated by significant Dice score gains.
It integrates late channel-wise fusion with transformer-based global reasoning, mitigating information loss common in traditional single-encoder models.

The Dual-Encoder TransUNet architecture is a hybrid deep neural network model that integrates dual convolutional encoders with a transformer-based bottleneck to address multimodal or multi-scale feature learning, particularly in medical image segmentation. This architectural pattern is primarily exemplified by multimodal variants for diffusion MRI stroke lesion segmentation and by dual-scale Swin-Transformer extensions for semantic segmentation in medical contexts. The defining feature is the use of two parallel encoders—each extracting distinct feature representations by modality or semantic scale—whose outputs are fused before transformer-based global reasoning and subsequent decoding, yielding performance improvements over conventional single-encoder baselines (Usman et al., 23 Dec 2025, Lin et al., 2021, Tran et al., 24 Jul 2024).

1. Architectural Components and Data Flow

The core Dual-Encoder TransUNet architecture consists of:

Input stacking with spatial context: For multimodal processing, independent modalities (e.g., DWI and ADC in MRI) are stacked along the channel axis, typically with multiple adjacent slices per modality, forming a six-channel input of dimensions $128\times128\times6$ (Usman et al., 23 Dec 2025).
Dual convolutional encoders: Two distinct encoders process the partitioned input—one per modality or semantic scale (e.g., patch-scale in Swin-UNet)—with identical architecture but independent weights. Each encoder has four downsampling stages with paired $3\times3$ convolutional layers, followed by BatchNorm and ReLU, succeeded by $2\times2$ max pooling. Feature channel dimensions double at each stage: 64, 128, 256, 512, yielding final feature maps of $512\times8\times8$ per encoder for a $128\times128$ input (Usman et al., 23 Dec 2025). In the multi-scale variant, branch differences are determined by patch size and Swin-Transformer block channel counts (Lin et al., 2021).
Bottleneck fusion: Outputs from the encoders are concatenated along the channel dimension, producing a fused feature map ( $1024\times8\times8$ in the MRI multimodal case), which is then flattened into a patch sequence for the transformer module (Usman et al., 23 Dec 2025).
Transformer module: The transformer bottleneck comprises a stack of $L=12$ layers with Multi-Head Self Attention (MHSA), Feed-Forward Networks (FFNs) with GELU activation, and additive LayerNorm, using positional embeddings as in ViT/TransUNet (Usman et al., 23 Dec 2025).
Decoder: The transformer output is reshaped to spatial dimensions and upsampled via four decoding stages, mirroring the encoder. Each decoder stage uses $2\times$ upsampling, two $3\times3$ conv+BN+ReLU blocks, and halves the feature count per stage (1024→64). For multimodal dual-encoder architectures, the decoder operates solely on the fused feature stream; skip connections from per-modality encoders are not utilized beyond the bottleneck (Usman et al., 23 Dec 2025).

2. Strategies for Multimodal and Multi-Scale Feature Extraction

Dual-encoder architectures address inherent limitations of single-branch models by:

Learning modality-specific representations: For medical images from different modalities (e.g., diffusion MRI with DWI and ADC), separate encoders prevent early information loss and capture distinct texture/contrast cues before fusion.
Leveraging multi-scale hierarchical features: In DS-TransUNet, parallel Swin-Transformer encoders process identical inputs at different patch sizes (fine: $s=4$ , coarse: $s=8$ ), learning both fine-grained pixel-level details and coarse global context. The outputs are fused at multiple scales using a Transformer Interactive Fusion (TIF) block that enables self-attention across semantic granularities (Lin et al., 2021).
Adapting to complex anatomical variability: The separate processing and late fusion are particularly useful in domains with high lesion heterogeneity, such as ischemic stroke (Usman et al., 23 Dec 2025).

3. Bottleneck Fusion and Transformer Design

The fusion and transformer bottleneck employ:

Late fusion by channel concatenation: Before entering the transformer, encoder outputs are concatenated, not added or averaged, maximizing retention of modality/scale-specific features.
Patch embedding and positional encoding: The fused feature map is divided into non-overlapping $2\times2$ patches (following standard ViT/TransUNet conventions); each patch is linearly embedded and supplemented with learnable positional embeddings before transformer processing.
MHSA and FFN blocks: Each transformer layer applies MHSA according to:

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$

followed by a two-layer FFN with expansion factor $r=4$ and GELU nonlinearity.

No per-encoder skip connections post-fusion: Unlike classic U-Net/TransUNet, skip connections in the dual-encoder setting do not cross from each encoder’s features to decoder stages; only the fused stream is decoded (Usman et al., 23 Dec 2025).

4. Training Protocols and Implementation

Loss and optimization: Binary Cross-Entropy with logits is employed for lesion segmentation:

$\mathcal{L}_\mathrm{BCE} = -\frac{1}{N}\sum_{i=1}^N [y_i \log(p_i) + (1-y_i)\log(1-p_i)]$

Batch size is 16, with 100 training epochs; the first five epochs freeze encoder weights (Usman et al., 23 Dec 2025). Optimizer type, learning rates, and regularization specifics are generally not reported.

Augmentation: The protocol includes random flips, rotations up to $270^\circ$ , and intensity-preserving resizing.
Framework and tracking: PyTorch is the implementation backbone, with FastAI for U-Net variants and Weights & Biases for experiment tracking (Usman et al., 23 Dec 2025). GPU specifics and training time are not provided.

5. Empirical Results and Benchmarking

Empirical evaluation demonstrates robust performance gains:

Multimodal stroke segmentation: On ISLES 2022, the dual-encoder TransUNet with three-slice input achieves a Dice Similarity Coefficient (DSC) of 85.4%. This outperforms single-encoder TransUNet (81.3%), Swin-UNet (76.4%), and various CNN-based U-Nets (74.2%, 71.6%). Expanding input from single to three slices per modality increases DSC from 83.1% to 85.4%, highlighting the importance of 3D context (Usman et al., 23 Dec 2025).
Multi-scale segmentation: DS-TransUNet outperforms single-branch Transformer U-Nets and CNN baselines by integrating fine and coarse feature representations via dual Swin-Transformer encoders and TIF blocks (Lin et al., 2021).
Semantic segmentation with neural fusion: Alternative architectures (e.g., Trans2Unet) establish dual branches with classic U-Net and a TransUNet (CNN→ViT) path, merging high-resolution decoder outputs for improved accuracy, as confirmed by gains in DSC on nuclei segmentation (Tran et al., 24 Jul 2024).
Statistical significance: Formal significance testing is not reported in surveyed work.

The dual-encoder paradigm manifests in several notable variants:

Model	Encoders Used	Fusion Location	Transformer Type
Dual-Encoder TransUNet (Usman et al., 23 Dec 2025)	Convolutional (per modality)	Bottleneck	Standard ViT/TransUNet
DS-TransUNet (Lin et al., 2021)	Swin-Transformer (fine/coarse)	Multi-scale (TIF)	Swin Transformer
Trans2Unet (Tran et al., 24 Jul 2024)	U-Net, TransUNet (CNN-ViT)	Decoder output	ViT (in TransUNet)

Dual-Encoder TransUNet targets multimodal fusion at the bottleneck and avoids per-modality skip connections post-fusion, while DS-TransUNet applies cross-scale attention fusion at multiple levels.
Trans2Unet implements "neural fusion" by concatenating decoder outputs of a U-Net and a TransUNet branch, with a lightweight convolutional head post-fusion (Tran et al., 24 Jul 2024).

A plausible implication is that dual-encoder TransUNet architectures are best suited for scenarios where either multiple imaging modalities offer complementary contrast mechanisms, or different spatial scales contribute distinct semantic information.

7. Limitations, Variations, and Future Directions

Skip connections: The lack of per-modality skip connections beyond bottleneck fusion in multimodal dual-encoder TransUNet may limit spatial detail recovery relative to architectures that support richer skip pathways.
Resource requirements: Doubling encoders increases computation and memory usage. Parameter counts rise accordingly, with, for example, Trans2Unet reporting ~5M more parameters than a single-branch TransUNet (Tran et al., 24 Jul 2024).
Potential extensions: Variants may introduce learnable attention-based fusion rather than simple concatenation, or apply deeper transformer fusion stages for richer modality interactions. Multi-branch configurations (beyond dual) and the integration of temporal context in volumetric data are areas of ongoing research.
Generalizability: While results on ISLES 2022 and DSB2018 benchmarks are strong, generalizability to less curated, multi-institutional datasets and to other imaging modalities is not fully established in published results.

The Dual-Encoder TransUNet approach thus constitutes a substantial advancement for segmentation tasks that demand explicit modeling of multimodal or multi-scale information, particularly in domains such as medical imaging, where feature diversity and context integration are key determinants of segmentation accuracy (Usman et al., 23 Dec 2025, Lin et al., 2021, Tran et al., 24 Jul 2024).