LoG-VMamba: Efficient 2D/3D Segmentation
- The paper introduces LoG-VMamba, a neural module that significantly outperforms CNN- and Transformer-based baselines in medical image segmentation benchmarks.
- It employs dual token extraction with a Local Token eXtractor (LTX) and Global Token eXtractor (GTX) to capture spatial adjacency and compress global features while maintaining linear complexity.
- Experimental results on datasets like MICCAI EndoVis, NeurIPS Cell, BraTS, and ACDC demonstrate improved Dice scores and reduced computational cost, underscoring its clinical potential.
LoG-VMamba (Local-Global Vision Mamba) is a neural network module designed for computationally efficient and accurate 2D and 3D medical image segmentation. It extends the State Space Model (SSM)-based “Mamba” architecture with explicit local and global spatial context encoding and is demonstrated to substantially outperform both CNN- and Transformer-based baselines in medical imaging benchmarks. LoG-VMamba achieves this by enforcing spatial adjacency among local features, compressing global features, and maintaining linear complexity relative to image size, addressing the challenges of high-dimensional medical imaging data (Dang et al., 2024).
1. State-Space Models and Vision Mamba Foundations
State Space Models (SSMs) describe dynamical systems via latent state trajectories and have been adapted for deep learning. The continuous-time linear SSM is described as:
where , , parameters are learnable.
Mamba leverages a selective SSM (S6/Mamba) discretization, enabling recurrent computation over sequences:
Mamba’s distinctiveness lies in making parameters input-dependent via small MLPs, achieving efficient hardware-aware kernels with compute.
The original Vision Mamba (VSS) block processes visual data as:
- Input .
- LayerNorm, channel expansion, then splits to:
- Depthwise convolution (DWC) followed by SiLU activation, then flattening and 1D SSM scan (optionally multiple directions).
- Pointwise SiLU.
- Branches are multiplied, projected to , and added residually to 0.
While VSS grants global receptive fields at linear complexity, it struggles to jointly capture spatial locality and global context, particularly in high-dimensional medical images due to its sequential processing constraints (Dang et al., 2024).
2. Architectural Innovations in LoG-VMamba
LoG-VMamba introduces explicit mechanisms for capturing both local and global dependencies by replacing the vanilla token extractor with two parallel modules—a Local Token eXtractor (LTX) and Global Token eXtractor (GTX):
- LTX: Squeezes channel dimension of 1 by factor 2 with DWC. An 3 “unfolding” window extracts all spatial neighbors for every position and flattens their features, guaranteeing each token encodes local context.
4
Each of the 5 positions becomes a token of dimension 6.
- GTX: Applies a dilated DWC with stride 7 for spatial compression, followed by channel grouping and linear projection to 8-dimensional tokens. This yields a compressed global summary.
9
Total number of global tokens: 0.
- Token Concatenation: Tokens from LTX and GTX are interleaved to produce 1. Interleaving, rather than appending, ensures both local and global contexts are available early in SSM processing.
- LoG-VMamba Block: The full processing pipeline is:
- LayerNorm of input.
- LTX and GTX extraction, interleaving tokens.
- Single-direction horizontal SSM scan:
2
- Spatial stacking, projection, gating, and addition to the residual branch.
This approach allows local and global dependencies to be encoded prior to sequential SSM computation, overcoming the locality/globality tension that limits prior SSM-based vision models.
3. Network Architectures for 2D and 3D Segmentation
LoG-VMamba is instantiated within U-shaped segmentation networks for both 2D and 3D data:
- 2D Model: Based on a Swin-UMamba encoder with patch merging and ImageNet pretraining. The decoder consists of four upsampling stages, each concatenating skip connections and applying 1–2 LoG-VMamba blocks at fixed spatial resolution and channel count.
- 3D Model: Based on U-Mamba-Enc, replacing all encoder Mamba blocks with LoG-VMamba. Input volumes vary (e.g., 3 for BraTS, 4 for ACDC). The decoder retains transposed-convolution upsampling, with skip-connections concatenated pre-LoG-VMamba.
A summary of feature-map sizes, per-stage pooling, kernel, and other design choices is provided in the original work. The architecture is explicitly designed to maintain linear complexity and facilitate efficient fusion of hierarchical features (Dang et al., 2024).
4. Computational Complexity and Memory Analysis
LoG-VMamba’s dominant computation is the single horizontal SSM scan over the concatenated token sequence:
5
- Self-attention (ViT): 6 per layer and 7 memory.
- CNNs: 8 for 9 convolutions, memory 0; global context grows only with depth.
LoG-VMamba’s efficient scan and controlled channel expansions/compressions reduce computational cost by 20–50% in GFLOPs relative to ViT-based decoders for the studied medical segmentation tasks, without sacrificing the capacity for global or local spatial modeling.
5. Experimental Protocols and Comparative Results
LoG-VMamba’s efficacy was evaluated on representative 2D/3D segmentation benchmarks:
- 2D Endoscopy (MICCAI EndoVis '17): Custom train/val splits (1440/360), RGB input 1.
- 2D Cell (NeurIPS '22): 800 train / 200 val.
- 3D BraTS 2020: 236 train, 59 val, 74 test.
- 3D ACDC: 160/40/100 train/val/test.
Augmentations (random flip, elastic, color jitter, intensity scaling, noise), Adam optimizer, Dice+Cross-Entropy loss, and 5-fold validation were used. Metrics included Dice, IoU, Normalized Surface Dice (NSD, 2D), and Hausdorff 95% (HD95, 3D).
Main Results:
| Task | Method | Params | GFLOPs | Dice % | IoU % | NSD % | HD95 (mm) |
|---|---|---|---|---|---|---|---|
| 2D Endo | Swin-UMamba† | 27.5M | 45.4G | 71.23 ± 1.00 | 67.81 ± 0.99 | 72.77 ± 1.02 | - |
| Ours (LoG) | 30.3M | 48.6G | 75.17 ± 0.24 | 71.68 ± 0.23 | 76.83 ± 0.25 | - | |
| 2D Cell | Swin-UMamba† | - | - | 73.50 ± 0.86 | - | 83.31 ± 0.66 | - |
| Ours (LoG) | - | - | 76.21 ± 0.10 | - | 86.44 ± 0.09 | - | |
| 3D BraTS | U-Mamba-Enc | - | - | 87.01 ± 0.10 | - | - | 4.38 ± 0.09 |
| SegMamba | - | - | 87.62 ± 0.16 | - | - | 4.73 ± 0.22 | |
| Ours (LoG) | - | - | 88.06 ± 0.08 | - | - | 3.97 ± 0.04 | |
| 3D ACDC | U-Mamba-Enc | - | - | 91.65 ± 0.31 | - | - | 1.13 ± 0.02 |
| U-Mamba-Bot | - | - | 91.94 ± 0.05 | - | - | 1.26 ± 0.15 | |
| Ours (LoG) | - | - | 92.18 ± 0.13 | - | - | 1.10 ± 0.00 |
LoG-VMamba consistently outperformed prior baselines across all metrics.
Ablation studies showed LTX and GTX each contributed to improvements in Dice and NSD/HD95, with their combination yielding the highest gains. Interleaved token placement was more effective than naive concatenation. Multi-direction SSM scans did not provide further benefit beyond single-horizontal direction (Dang et al., 2024).
6. Component Analysis and Design Choices
The table below delineates the contribution of each architectural component (results from Endoscopy/BraTS):
| Block | Dice % (Endo/BraTS) | NSD % / HD95 (Endo/BraTS) |
|---|---|---|
| VSS | 71.23 / 87.01 | 72.77 / 4.38 |
| +GTX only | 72.64 / 87.99 | 74.24 / 4.05 |
| +LTX only | 74.15 / 87.71 | 75.81 / 4.12 |
| LoG | 75.17 / 88.06 | 76.83 / 3.97 |
Early global context (GTX) and explicit local adjacency (LTX) are each responsible for ~1–2% Dice gains; using both yields a further improvement. Interleaved token placement maximizes performance.
A single-direction SSM (2) suffices for effective context propagation; adding additional scanning directions did not increase accuracy. This suggests LoG-VMamba’s local/global token structure enables sufficient information flow within a linear scan (Dang et al., 2024).
7. Implementation Outline and Summary
The recipe for LoG-VMamba integration in segmentation training consists of:
3
where each LoG_VMamba block encapsulates the LTX/GTX extraction, interleaved sequencing, SSM scan, gating, and residual addition. Networks employing LoG-VMamba demonstrate marked segmentation accuracy benefits with strict linear complexity scaling, making this approach particularly suitable for high-dimensional 2D/3D medical imaging tasks (Dang et al., 2024).