Tri-Plane Mamba: Efficient 3D Segmentation
- Tri-Plane Mamba (TP-Mamba) is a parameter-efficient 3D adaptation architecture that extends SAM using multi-scale 3D conv adapters and tri-plane state space models.
- It employs custom residual adapters to inject local depth context and capture long-range volumetric dependencies across axial, coronal, and sagittal planes.
- Validated on CT and MRI segmentation, TP-Mamba preserves most of SAM’s 2D pretraining and excels in low-annotation regimes with state-of-the-art performance.
Tri-Plane Mamba (TP-Mamba) refers to a family of parameter-efficient 3D adaptation architectures for integrating the Segment Anything Model (SAM)—an extensively pre-trained, general-purpose vision transformer model—with volumetric medical images. By combining multi-scale 3D convolutional adapters for local context and tri-plane structured state-space models (Mamba SSMs) for long-range modeling in three orthogonal planes, TP-Mamba achieves efficient and effective 3D segmentation while retaining most of SAM’s 2D pretraining and scaling favorably to large input volumes (Shahraki et al., 31 Jan 2026, Wang et al., 2024). This technique has been validated in multiple recent studies for CT and MRI segmentation, demonstrating state-of-the-art performance, especially in low-annotation regimes.
1. Core Architectural Principles
TP-Mamba architectures are constructed by introducing two complementary modules into a frozen, slice-wise SAM backbone:
- Multi-Scale 3D Convolutional Adapters: Designed to inject localized depth-wise inductive bias, these adapters reshape slice-wise features back into a pseudo-3D tensor and apply multiple parallel, dilated 3D convolutions along the depth axis. This enables depth-aware context aggregation without disturbing the 2D feature extraction pipeline of the pre-trained SAM ViT encoder.
- Tri-Plane State Space Modeling (SSMs): To capture non-local, volumetric correlations, features are flattened along three major orthogonal planes—Axial (Height–Width), Coronal (Depth–Height), and Sagittal (Depth–Width)—and processed with specialized Mamba SSM blocks. Each SSM works along one major plane, capturing long-range dependencies efficiently due to their linear time and memory complexity.
Both modules are inserted as residual adapters after every major block (Multi-Head Self-Attention and MLP) or equivalently after each ViT layer, ensuring dense synergy between global non-local modeling and local spatial sensitivity (Shahraki et al., 31 Jan 2026, Wang et al., 2024).
2. Mathematical Formulations
Let denote a batch of 3D volumes, and be the feature output of a ViT encoder block.
Multi-Scale 3D Conv Adapter:
- Channel reduction: 3D or convolutions reduce feature channels from to .
- Parallel convolutions: Four parallel 3D convolutions of kernel size with dilations and padding embed multi-scale depth features.
- Channel concatenation: Output tensors are concatenated along the channel dimension and projected back to the original feature width.
- Residual path: The result is added back to the input for efficient bottleneck adaptation.
Tri-Plane Mamba Module:
- Plane slicing: The 3D tensor is flattened along three axes:
- Axial (HW): Sequences of spatial tokens for all slices, sequence length , feature dimension .
- Coronal (DH): 0 positions, 1 features.
- Sagittal (DW): 2 positions, 3 features.
- SSM application: Each sequence is processed with a Mamba SSM block. Discrete SSM updates are of the form:
4
where matrices 5 are input-dependent (dynamic, "selective scan").
- Re-fusion: Each plane’s output is reshaped to a 3D volume, the outputs are summed (or concatenated and linearly fused), projected, and residually added to the pre-adapter input.
3. Integration into SAM Backbones
Adapters are interleaved into every layer of the frozen pre-trained SAM ViT encoder to minimize the number of trainable parameters and maximize transfer from 2D pretraining. A typical integration flow for each ViT block (Wang et al., 2024):
- Multi-Head Self-Attention with LoRA low-rank adapters (frozen base).
- MLP block.
- Multi-Scale 3D Conv Adapter.
- Tri-Plane Mamba Module.
- Output features are propagated through the stack and the outputs of the last several blocks are decoded with a lightweight 3D upsampling segmentation head.
This adapterized approach allows for specialization on volumetric data while freezing ≥ 90–95% of base SAM weights, yielding