Vision-xLSTM U-Nets for Efficient Segmentation

Updated 13 April 2026

Vision-xLSTM U-Nets are segmentation architectures that integrate extended LSTM cells into U-Net encoder–decoder frameworks to capture long-range spatial dependencies with linear complexity.
The models replace standard CNN or transformer modules with optimized vision-LSTM blocks, enhancing feature localization and improving segmentation accuracy across diverse domains.
They demonstrate competitive performance on 2D and 3D benchmarks while reducing computational costs by 20–30% compared to traditional transformer-based U-Nets.

Vision-xLSTM U-Nets are a class of segmentation architectures that systematically embed Extended Long Short-Term Memory (xLSTM) cells—optimized for visual tasks—into U-Net–style encoder–decoder networks. These models are designed for efficient and robust semantic and medical image segmentation, capturing long-range spatial dependencies with linear complexity in sequence length. Vision-xLSTM U-Nets have demonstrated strong empirical performance across 2D and 3D segmentation benchmarks, surpassing CNNs, Vision Transformer (ViT)-based models, and state space models (e.g., Mamba) in several settings (Chen et al., 2024, Fang et al., 2024, Guo et al., 13 Jan 2025). Architectural variations span medical, remote sensing, and dermatological domains.

1. Vision-xLSTM Cell and Its Visual Instantiations

The Extended LSTM (xLSTM) cell generalizes the standard LSTM with optimized gating, memory structures, and linear-time recurrence. In the Vision-LSTM (ViL) variant, the cell operates on sequences of patch or grid tokens extracted from images or feature maps.

The xLSTM update at timestep $t$ for token $x_t \in \mathbb{R}^C$ :

$\begin{aligned} i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \ f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \ o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \ \tilde{c}_t &= \tanh(W_c x_t + U_c h_{t-1} + b_c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \ h_t &= o_t \odot \tanh(c_t) \end{aligned}$

Extensions include:

sLSTM: scalar state cell per head for robustness to long sequences.
mLSTM: matrix-valued cell states and matrix gating for enhanced memory capacity.
Exponential gating: sharper gating with element-wise exponential functions.
Directional scanning: bidirectional or alternating scanning along patch/token sequences for pseudo-global image context (Zhu et al., 2024, Guo et al., 13 Jan 2025).

In medical and general vision segmentation, these xLSTM or mLSTM blocks are inserted at different encoder stages, after patchification or spatial convolutional processing.

2. Architectures of Vision-xLSTM U-Nets

Vision-xLSTM U-Nets preserve the canonical U-Net topology—multi-stage encoder with skip connections to a symmetric decoder—but substitute standard convolutional blocks or transformer layers with Vision-xLSTM modules at strategic locations. Variants include xLSTM-UNet (Chen et al., 2024), Seg-LSTM (Zhu et al., 2024), XLSTM-VMUNet (Fang et al., 2024), and UNETVL (Guo et al., 13 Jan 2025).

Generalized architecture:

Encoder: At each scale, local feature extraction is performed via convolutional blocks or Vision Selective State-space (VSS) modules. Feature maps are flattened into sequences and passed through one or more Vision-xLSTM blocks.
Skip connections: Feature maps before xLSTM transformation, after, or both, are routed to the decoder for fine localization.
Decoder: Mirroring the encoder, using upsampling (transposed convolution, linear upsampling) and further convolutional or VSS blocks, optionally fused with xLSTM outputs.
Bridging modules: Hybrid models such as XLSTM-VMUNet embed an xLSTM over the sequence of multi-scale encoder features (i.e., modeling scale-wise dependencies in addition to spatial).
Augmented projections: UNETVL combines Vision-LSTM with Chebyshev KAN non-linear projections, replacing standard MLP channel mixing blocks (Guo et al., 13 Jan 2025).

A representative encoder–decoder pipeline:

Stage	Operation	Description
Input	Patchification/Convolution	Extract spatial tokens
Encoder $l$	Conv/Norm/ReLU $\to$ flatten $\to$ xLSTM	Local context then long-range sequential modeling
Skip $l$	Branch before/after xLSTM	Connect to decoder for localization
Bridge	Extra xLSTM / mLSTM / scale-wise LSTM	Deepest representation for context aggregation
Decoder $l$	Upsample $\to$ Conv/Norm/ReLU $\to$ fusion	Combine upsampled + skip (with xLSTM or VSS fusion)
Head	1×1 conv (per-pixel or per-voxel logits)	Segmentation output

3. Training Protocols and Implementation Details

Extensive empirical evaluations across biomedical and remote sensing datasets follow task-specific protocols, with commonalities:

Losses: Combined Dice and cross-entropy losses; for binary segmentation, BCE + Dice (BceDice) (Fang et al., 2024).
Optimization: AdamW with weight decay (0.01–0.05), learning rates adapted per dataset and backbone, batch sizes tuned for hardware (Chen et al., 2024, Fang et al., 2024, Guo et al., 13 Jan 2025).
Augmentation: Standard nnU-Net-style augmentations for 2D/3D images, including random cropping, flipping, rotation, and intensity jitters.
Epochs and hardware: Typical training for 1000 epochs on medical datasets; NVIDIA A100 and RTX 4090 GPUs are typical.
Hyperparameters: Number of xLSTM layers matches encoder depth; hidden/channel sizes scale per U-Net convention; Chebyshev KAN polynomial degrees fixed per model.

4. Quantitative Performance and Comparative Results

Vision-xLSTM U-Nets demonstrate competitive or superior performance across benchmarks:

2D Medical Segmentation (Mean ± Std) (Chen et al., 2024): | Method | Abd MRI Dice | Endoscopy Dice | Microscopy F1 | Params (M) | |------------|--------------|---------------|---------------|------------| | UNETR | 0.5747±0.17 | 0.5017±0.32 | 0.4357±0.26 | 90 | | U-Mamba_Enc| 0.7625±0.11 | 0.6303±0.31 | 0.5607±0.28 | 38 | | Ours_enc | 0.7747±0.10 | 0.6843±0.30 | 0.6036±0.24 | 40 |

3D MRI Segmentation (BraTS2023; Dice ↑ / HD95 ↓):

Method	Avg Dice	Avg HD95	Params (M)
SwinUNETR-V2	89.39	4.51	88
SegMamba	91.32	3.56	36
Ours	91.80	4.03	42

3D Abdomen MRI (Dice ↑/NSD ↑):

Method	Dice	NSD	Params (M)
nnU-Net	0.8309	0.8996	30
U-Mamba_Bot	0.8453	0.9121	32
Ours_bot	0.8483	0.9153	35

ISIC2018 Skin Lesion Segmentation (DSC/IoU) (Fang et al., 2024): | Method | DSC | IoU | |-----------------|--------|--------| | VM-UNet | 0.8975 | 0.8142 | | XLSTM-VMUNet | 0.9100 | 0.8349 |

3D Medical Benchmarks (UNETVL, mean Dice) (Guo et al., 13 Jan 2025): | Method | ACDC | AMOS | Params (M) | |------------|-------|-------|------------| | UNETR | 85.34 | 76.59 | 146.6 | | SwinUNETR | 91.29 | 83.81 | 62.2 | | UNETVL (KAN)| 91.59| 88.57 | 158.61 |

Notably, xLSTM-UNet and UNETVL outperform both transformer-based and Mamba-based U-Nets in mean Dice and convergence speed, with only modest increases in parameter count relative to Transformer models.

5. Computational Complexity, Efficiency, and Scalability

The Vision-xLSTM cell achieves linear time complexity in sequence length ( $x_t \in \mathbb{R}^C$ 0 per layer), compared to $x_t \in \mathbb{R}^C$ 1 for full self-attention in ViTs. For 2D and 3D segmentation:

FLOPs: xLSTM-UNet reduces floating-point operations by 20–30% versus Transformer-based U-Nets at comparable capacity (Chen et al., 2024).
Inference times: On NVIDIA A100, end-to-end 2D inference is ~40 ms per slice for xLSTM-UNet versus 60 ms (SwinUNETR) and 30 ms (nnU-Net); 3D volumes run at ~0.8 s (xLSTM-UNet), ~1.1 s (ViT-based), and ~0.7 s (CNN-only).
Memory footprint: Scales linearly with sequence length, supporting larger images or higher 3D resolutions.
KAN-enhanced variants: KAN imposes minimal parameter overhead relative to a standard MLP and achieves higher representational flexibility (Guo et al., 13 Jan 2025).

6. Hybrid and Advanced Variants: Integration with State Space Models and Nonlinear Projections

Recent architectures hybridize Vision-xLSTM with additional advanced modules:

XLSTM-VMUNet: Interleaves VSS (Mamba-inspired) blocks for spatial feature extraction and applies xLSTM across multi-scale features, modeling cross-level dependencies. Fusion with learnable weights at each skip enhances contextual sensitivity, boosting skin lesion segmentation performance (Fang et al., 2024).
UNETVL: Replaces ViT modules in the encoder with bidirectional mLSTM (ViL) blocks. Channel mixing/projection uses Chebyshev Kolmogorov-Arnold Networks (KAN), outperforming B-spline, MLP, and RBF channels for non-linear mixing (Guo et al., 13 Jan 2025).
Ablative evidence: Incorporation of matrix-valued memory (mLSTM), exponential gating, and scale-wise xLSTM all yield steady gains over VSS-only or CNN-only baselines.

7. Limitations, Current Challenges, and Future Directions

Vision-xLSTM U-Nets offer robust global context modeling at a fraction of the cost of ViT, but face open problems:

Implementation maturity: PyTorch kernels lack low-level fusion for xLSTM, impeding throughput relative to mature CNNs.
Dataset and scale: Published results target moderate-size, well-annotated datasets; large-scale unlabeled or ultra-high-res histopathology remains unexplored (Chen et al., 2024).
Global context coverage: Alternating or bidirectional xLSTM scanning only partially restores 2D/3D global context—architectures with multi-directional or scale-aware sequence processing are open avenues (Zhu et al., 2024).
Pretraining: No multi-modal or multi-scale pretraining of the xLSTM backbone has been reported.
Foundation models: Prospective extensions include pretraining Vision-LSTM on large, multi-modal medical corpora for general-purpose segmentation (Chen et al., 2024).

Identified future directions include optimal CUDA kernel development, exploration of hybrid block layouts (xLSTM with SSM/cross-attention), testing on more diverse and larger-scale datasets, and systematic transfer learning protocols for medical applications.

References

(Chen et al., 2024) xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart.
(Fang et al., 2024) When Mamba Meets xLSTM: An Efficient and Precise Method with the xLSTM-VMUNet Model for Skin lesion Segmentation.
(Guo et al., 13 Jan 2025) UNetVL: Enhancing 3D Medical Image Segmentation with Chebyshev KAN Powered Vision-LSTM.
(Zhu et al., 2024) Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images.