Papers
Topics
Authors
Recent
Search
2000 character limit reached

Swin Transformer UNet3D Architecture

Updated 20 March 2026
  • Swin Transformer UNet3D is a deep learning architecture that fuses 3D shifted window transformer blocks with a U-shaped encoder-decoder to capture both local and global volumetric features.
  • It utilizes a hierarchical design featuring patch embedding, window-based and shifted window attention, and skip connections to efficiently restore high-resolution outputs.
  • Instantiations like Swin UNETR, Swin SMT, and SwinUNet3D achieve state-of-the-art performance in tasks such as MRI brain tumor segmentation, CT anatomical segmentation, and spatiotemporal traffic forecasting.

The Swin Transformer UNet3D family represents a class of deep learning architectures that integrate 3D shifted window-based transformer modules—originally developed as Swin Transformers—into the widely adopted U-shaped encoder–decoder framework of UNet. Designed for dense prediction tasks in volumetric data, these models efficiently capture both local and global volumetric dependencies, achieving leading performance in medical image segmentation and spatiotemporal forecasting. Prominent instantiations include Swin UNETR for 3D MRI segmentation (Hatamizadeh et al., 2022), Swin SMT for whole-body CT anatomical segmentation (Płotka et al., 2024), and SwinUNet3D for traffic forecasting (Bojesomo et al., 2022).

1. Architectural Foundations

The Swin Transformer UNet3D architecture comprises a hierarchical, multi-stage encoder built from 3D Swin Transformer blocks and a U-shaped decoder. The core pipeline is:

  • Input Partitioning: Volumetric or spatiotemporal tensors are partitioned into non-overlapping 3D patches (e.g., 2×2×22\times2\times2 voxels).
  • Patch Embedding: Each patch is linearly projected to an embedding dimension, yielding a sequence of 3D tokens (e.g., for MRI: $32$-dimensional tokens projected to C=48C=48).
  • Hierarchical 3D Swin Transformer Encoder: Multiple stages (typically four) process representations at successively lower spatial resolutions, doubling the embedding dimension at each stage. At each stage, multiple blocks alternate between Window-based Multi-head Self-Attention (W-MSA), Shifted Window MSA (SW-MSA), and feedforward layers.
  • U-shaped Decoder: A symmetric decoder upsamples features using transpose convolutions or patch expanding layers, fusing encoder features from matched resolutions via skip connections. The decoder restores segmentation or forecasting predictions to the original resolution.

The following table summarizes architectural variants:

Name Encoder Blocks Skip/Decoder Type FFN Variant Application Domain
Swin UNETR Swin 3D FCNN/Transpose Conv 2-layer MLP 3D MRI semantic segmentation
Swin SMT Swin 3D FCNN/Transpose Conv Soft MoE (stage2+) Whole-body CT segmentation
SwinUNet3D Swin 3D Pure Swin (Patch Expand) 2-layer MLP Spatiotemporal traffic forecast

2. 3D Shifted Window Transformer Blocks

Central to the architecture, each Swin Transformer block applies attention within non-overlapping local 3D windows of cubic shape (e.g., 7×7×77\times7\times7 for MRI, or 8×8×18\times8\times1 for traffic data). Each transformer block sequence alternates between:

  • Window-based MSA (W-MSA): Computes self-attention in parallel within all local windows; parameters and computational cost scale with window size.
  • Shifted Window MSA (SW-MSA): Before attention, shifts the window grid by (⌊M/2⌋,⌊M/2⌋,⌊M/2⌋)(\lfloor M/2\rfloor, \lfloor M/2\rfloor, \lfloor M/2\rfloor) voxels, so that neighboring regions can interact across block boundaries, thereby increasing the effective receptive field.

Within each window, standard query, key, value projections and scaled dot-product attention are applied: Q=XWQ,K=XWK,V=XWV,Q = XW_Q, \quad K = XW_K, \quad V = XW_V,

Attention(Q,K,V)=Softmax(QKTdh+B)V,\mathrm{Attention}(Q,K,V) = \mathrm{Softmax}\Bigl( \frac{Q K^T}{\sqrt{d_h}} + B \Bigr) V,

where BB denotes (optional) learnable relative-position bias.

The feed-forward network (FFN) in standard blocks is a two-layer MLP. In Swin SMT, this FFN is replaced in stages 2–4 by a Soft Mixture-of-Experts (Soft MoE) layer, enabling higher capacity without excessive computational burden (Płotka et al., 2024). SwinUNet3D uses an MLP without channel expansion (mlp-ratio=1) (Bojesomo et al., 2022).

3. Hierarchical Encoder–Decoder Design

The U-shaped UNet backbone is realized via hierarchical downsampling and upsampling:

  • Encoder Downsampling: After each pair (or quadruple) of Swin Transformer blocks at a given resolution, a patch merging operation groups neighboring tokens (spatially or spatiotemporally), concatenates features, and linearly projects to a higher channel dimension while halving (or further reducing) spatial resolution.
  • Decoder Upsampling: Decoder stages invert the merging process using transpose convolutions (Swin UNETR, Swin SMT) or patch expanding (SwinUNet3D). Each decoder stage receives a skip connection from its encoder counterpart—features are either concatenated or combined via element-wise addition.
  • Final Output: The representation is projected to the target output via a 1×1×11\times1\times1 convolution (segmentation mask/logits) or fully-connected prediction head (traffic forecasting).

In SwinUNet3D, all blocks—including the decoder—are pure Swin Transformer layers, with no convolutional operations (Bojesomo et al., 2022). In Swin UNETR and Swin SMT, the decoder is convolutional.

4. Application Domains and Task Formulations

Swin Transformer UNet3D architectures have been adopted for various dense prediction tasks:

  • 3D Brain Tumor Segmentation: Swin UNETR processes a four-channel MRI input (T1, T1c, T2, FLAIR), reformulating the segmentation task as a sequence prediction problem. The network outputs probability maps for enhancing tumor (ET), whole tumor (WT), and tumor core (TC) (Hatamizadeh et al., 2022).
  • Whole-body CT Anatomical Segmentation: Swin SMT applies an enhanced Swin UNETR backbone—incorporating Soft MoE layers—to TotalSegmentator-V2, predicting $117$ anatomical structures in resampled CT volumes (PÅ‚otka et al., 2024).
  • Spatiotemporal Traffic Forecasting: SwinUNet3D takes as input T×C×H×WT\times C\times H\times W tensors (12 frames × 8 channels), uses a feature mixing fully connected layer to blend time and channel axes, and outputs the next predicted sequence (6 frames) (Bojesomo et al., 2022).

Training, inference, and augmentation strategies are closely tailored to task and data domain, with approaches such as sliding window inference and extensive data normalization applied as needed.

5. Performance Results and Empirical Comparisons

Swin Transformer UNet3D models have established state-of-the-art results in their respective domains.

  • BraTS 2021 Validation (Swin UNETR): On 219 MRI cases, Dice scores achieved were (ET, WT, TC) = (0.858, 0.926, 0.885), with corresponding Hausdorff 95 distances (6.02, 5.83, 3.77 mm). On the hidden test set, Dice = (0.853, 0.927, 0.876) (Hatamizadeh et al., 2022).
  • BraTS 2021 Cross-Validation:

| Method | ET | WT | TC | Avg | |:---------------|:-----:|:-----:|:-----:|:-----:| | Swin UNETR | 0.891 | 0.933 | 0.917 | 0.913 | | nnU-Net | 0.883 | 0.927 | 0.913 | 0.908 | | SegResNet | 0.883 | 0.927 | 0.913 | 0.907 | | TransBTS (ViT) | 0.868 | 0.911 | 0.898 | 0.891 |

Swin UNETR outperformed the next best by up to +0.7% Dice (ET).

  • TotalSegmentator-V2 (Swin SMT): Average Dice (n=32 experts) was 85.09%85.09\%, compared to Swin UNETR-L (83.59%83.59\%), nnU-Net (83.44%83.44\%), and others. Soft MoE parameter scaling increased model size from $62.2$M to $170.8$M and provided significant Dice improvements; gains are statistically significant (ANOVA p<0.05p < 0.05) (PÅ‚otka et al., 2024).
  • Traffic4cast2021 (SwinUNet3D): Core task MSE was $49.7208$ (embedding d=192d=192, feature mixing enabled), outperforming GCN baseline ($51.7143$) and UNet baseline ($51.2826$) (Bojesomo et al., 2022).

6. Distinct Architectural Innovations

Shifted Window Attention

Alternating W-MSA and SW-MSA blocks enables hierarchical, computationally efficient self-attention across the entire 3D volume, facilitating effective long-range modeling unattainable with shallow CNNs.

Soft Mixture-of-Experts (MoE) Layers

Swin SMT replaces traditional FFNs (stages 2–4) with Soft MoE: a learned routing mechanism assigns tokens (using gating networks) to expert-specific MLPs, aggregates expert outputs, and blends them per token using learned weights. This design increases capacity and modeling flexibility without linear computational scaling (Płotka et al., 2024).

Feature Mixing for Spatiotemporal Signals

SwinUNet3D introduces a "feature mixing" fully connected layer that mixes the time and channel axes before patch embedding, enabling the model to preserve temporally rich feature relations during spatial partitioning (Bojesomo et al., 2022). Ablation studies confirm this layer improves forecasting accuracy.

7. Implementation and Optimization Practices

Implementations leverage PyTorch and MONAI (Hatamizadeh et al., 2022); state-of-the-art training procedures include

  • Augmentation: Random crops, intensity shifts/scaling, flip operations.
  • Loss Functions: Soft Dice loss for segmentation (Swin UNETR/Swin SMT), mean squared error for regression (SwinUNet3D).
  • Optimization: AdamW optimizer (typical initial learning rates 8×10−48\times10^{-4} to 10−410^{-4}), linear warmup followed by cosine annealing.
  • Hardware: High-memory GPUs (V100, A100), with batch size typically $1/$GPU to accommodate volumetric inputs.
  • Inference: Sliding-window processing (window size 1283128^3 voxels), model ensembling for robustness.

Inference time varies by model and hardware (e.g., Swin UNETR, ∼\sim10 s per volume on V100 (Hatamizadeh et al., 2022); Swin SMT, ∼\sim60 s per full scan on A100 (Płotka et al., 2024)).


Swin Transformer UNet3D architectures, as instantiated in Swin UNETR (Hatamizadeh et al., 2022), Swin SMT (Płotka et al., 2024), and SwinUNet3D (Bojesomo et al., 2022), achieve state-of-the-art performance across diverse 3D prediction tasks by unifying scalable hierarchical transformers, efficient window-local self-attention, and task-aligned decoder/skip designs. These architectural paradigms establish a template for high-capacity, data-adaptive modeling in volumetric segmentation and 3D spatiotemporal inference.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Swin Transformer UNet3D.