Tri-Plane Mamba: Efficient 3D Segmentation

Updated 14 April 2026

Tri-Plane Mamba (TP-Mamba) is a parameter-efficient 3D adaptation architecture that extends SAM using multi-scale 3D conv adapters and tri-plane state space models.
It employs custom residual adapters to inject local depth context and capture long-range volumetric dependencies across axial, coronal, and sagittal planes.
Validated on CT and MRI segmentation, TP-Mamba preserves most of SAM’s 2D pretraining and excels in low-annotation regimes with state-of-the-art performance.

Tri-Plane Mamba (TP-Mamba) refers to a family of parameter-efficient 3D adaptation architectures for integrating the Segment Anything Model (SAM)—an extensively pre-trained, general-purpose vision transformer model—with volumetric medical images. By combining multi-scale 3D convolutional adapters for local context and tri-plane structured state-space models (Mamba SSMs) for long-range modeling in three orthogonal planes, TP-Mamba achieves efficient and effective 3D segmentation while retaining most of SAM’s 2D pretraining and scaling favorably to large input volumes (Shahraki et al., 31 Jan 2026, Wang et al., 2024). This technique has been validated in multiple recent studies for CT and MRI segmentation, demonstrating state-of-the-art performance, especially in low-annotation regimes.

1. Core Architectural Principles

TP-Mamba architectures are constructed by introducing two complementary modules into a frozen, slice-wise SAM backbone:

Multi-Scale 3D Convolutional Adapters: Designed to inject localized depth-wise inductive bias, these adapters reshape slice-wise features back into a pseudo-3D tensor and apply multiple parallel, dilated 3D convolutions along the depth axis. This enables depth-aware context aggregation without disturbing the 2D feature extraction pipeline of the pre-trained SAM ViT encoder.
Tri-Plane State Space Modeling (SSMs): To capture non-local, volumetric correlations, features are flattened along three major orthogonal planes—Axial (Height–Width), Coronal (Depth–Height), and Sagittal (Depth–Width)—and processed with specialized Mamba SSM blocks. Each SSM works along one major plane, capturing long-range dependencies efficiently due to their linear time and memory complexity.

Both modules are inserted as residual adapters after every major block (Multi-Head Self-Attention and MLP) or equivalently after each ViT layer, ensuring dense synergy between global non-local modeling and local spatial sensitivity (Shahraki et al., 31 Jan 2026, Wang et al., 2024).

2. Mathematical Formulations

Let $X\in\mathbb{R}^{B\times1\times D\times H\times W}$ denote a batch of 3D volumes, and $F_{\text{in}}\in\mathbb{R}^{(H/16\cdot W/16)\times D_{\text{sam}}}$ be the feature output of a ViT encoder block.

Multi-Scale 3D Conv Adapter:

Channel reduction: 3D $1\times1\times1$ or $3\times1\times1$ convolutions reduce feature channels from $C$ to $r$ .
Parallel convolutions: Four parallel 3D convolutions of kernel size $3\times1\times1$ with dilations $d\in\{1,2,4,8\}$ and padding embed multi-scale depth features.
Channel concatenation: Output tensors are concatenated along the channel dimension and projected back to the original feature width.
Residual path: The result is added back to the input for efficient bottleneck adaptation.

Tri-Plane Mamba Module:

Plane slicing: The 3D tensor is flattened along three axes:
- Axial (HW): Sequences of spatial tokens for all slices, sequence length $h\cdot w$ , feature dimension $D\cdot r$ .
- Coronal (DH): $F_{\text{in}}\in\mathbb{R}^{(H/16\cdot W/16)\times D_{\text{sam}}}$ 0 positions, $F_{\text{in}}\in\mathbb{R}^{(H/16\cdot W/16)\times D_{\text{sam}}}$ 1 features.
- Sagittal (DW): $F_{\text{in}}\in\mathbb{R}^{(H/16\cdot W/16)\times D_{\text{sam}}}$ 2 positions, $F_{\text{in}}\in\mathbb{R}^{(H/16\cdot W/16)\times D_{\text{sam}}}$ 3 features.
SSM application: Each sequence is processed with a Mamba SSM block. Discrete SSM updates are of the form:

$F_{\text{in}}\in\mathbb{R}^{(H/16\cdot W/16)\times D_{\text{sam}}}$ 4

where matrices $F_{\text{in}}\in\mathbb{R}^{(H/16\cdot W/16)\times D_{\text{sam}}}$ 5 are input-dependent (dynamic, "selective scan").

Re-fusion: Each plane’s output is reshaped to a 3D volume, the outputs are summed (or concatenated and linearly fused), projected, and residually added to the pre-adapter input.

3. Integration into SAM Backbones

Adapters are interleaved into every layer of the frozen pre-trained SAM ViT encoder to minimize the number of trainable parameters and maximize transfer from 2D pretraining. A typical integration flow for each ViT block (Wang et al., 2024):

Multi-Head Self-Attention with LoRA low-rank adapters (frozen base).
MLP block.
Multi-Scale 3D Conv Adapter.
Tri-Plane Mamba Module.
Output features are propagated through the stack and the outputs of the last several blocks are decoded with a lightweight 3D upsampling segmentation head.

This adapterized approach allows for specialization on volumetric data while freezing ≥ 90–95% of base SAM weights, yielding

Markdown Report Issue Upgrade to Chat

References (2)

A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation (2026)

Tri-Plane Mamba: Efficiently Adapting Segment Anything Model for 3D Medical Images (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tri-Plane Mamba (TP-Mamba).