WinMamba: 3D SSM Backbone for LiDAR

Updated 24 November 2025

WinMamba is a pure state-space model-based 3D backbone that employs linear-complexity Mamba kernels and custom window fusion strategies for efficient LiDAR object detection.
It integrates Window Shift Fusion and Adaptive Window Fusion to overcome fixed-window limitations, ensuring superior multi-scale spatial representation and robust context recovery.
Extensive evaluations on KITTI and Waymo datasets show significant improvements in mAP, demonstrating its performance enhancements over traditional SSM baselines.

WinMamba is a pure state-space model (SSM)-based 3D backbone, designed to provide a computationally efficient, long-range modeling framework for high-fidelity voxel feature extraction in LiDAR-based object detection. By integrating linear-complexity Mamba kernels with two custom window-based fusion strategies—Window Shift Fusion (WSF) and Adaptive Window Fusion (AWF)—WinMamba addresses key limitations of conventional axis-aligned, fixed-window methods, enabling superior multi-scale spatial representation and robust cross-window context recovery. Extensive evaluation on standard autonomous driving datasets demonstrates significant performance improvements over strong SSM baselines, as well as detailed ablation of architectural contributions (Zheng et al., 17 Nov 2025).

1. System Architecture

WinMamba replaces the standard 3D-sparse-convolution backbone in a voxel-based 3D detection pipeline with a series of N stacked WinMamba Blocks within a four-stage feature pyramid network (FPN). The canonical pipeline consists of:

Voxel Feature Encoder (VFE): Employs PointPillars or SECOND variants to encode the point cloud into a dense $C$ -dimensional embedding $f \in \mathbb{R}^{C \times X \times Y \times Z}$ .
3D Backbone: Each stage is implemented by a WinMamba Block, which spatially downsamples by a factor $d$ , applies WinMamba Layers at both main (lower resolution) and auxiliary (higher resolution) streams, and fuses the outputs.
BEV Backbone & Detection Head: Follows established architectures such as LION-Mamba and CenterPoint.

The high-level computation in one WinMamba Block can be summarized by the following pseudocode:

def WinMambaBlock(I_in, d, ws):
    # I_in: %%%%3%%%% voxel features
    # d: downsampling factor; ws: window size (main path)
    # 1. Main (low-res) path
    I_main = Downsample(I_in, factor=d)
    O1 = WinMambaLayer(I_main, ws)
    # 2. Auxiliary (high-res) path
    ws_aux = d * ws
    O2 = Downsample(WinMambaLayer(I_in, ws_aux), d)
    # 3. Fuse
    O_out = O1 + O2
    return O_out

By stacking multiple blocks with increasing $d$ (typically $2 \times$ per stage), WinMamba achieves efficient multiscale feature representation suitable for downstream 3D detection.

2. Linear State-Space Model Design

At the core of each WinMamba Layer is a Mamba Block implementing a discrete linear SSM over serialized windowed voxel sequences. For a sequence $u_1,\ldots,u_L$ , the recursion is:

$x_{t+1} = A x_t + B u_t, \quad y_t = C x_t + D u_t$

where $x_t \in \mathbb{R}^N$ (hidden state), $u_t \in \mathbb{R}^C$ (input token), $y_t \in \mathbb{R}^C$ (output token), and $A, B, C, D$ are shared and learnable. Structuring $A$ (e.g., diagonal plus low-rank) enables $O(L)$ scan complexity for efficient long-range aggregation.

Operationally, each WinMambaLayer alternates applying the SSM along both X and Y axes: windows are partitioned, serialized, mapped by Mamba, and deserialized—preserving linear complexity while capturing two-dimensional spatial interactions.

3. Window-Scale-Adaptive Sampling

Window-Scale Adaptation (WSA) is integral to the AWF module, ensuring that main and auxiliary streams at differing resolutions maintain physical alignment of their receptive fields. The window size for the auxiliary stream is set as $ws_{\text{aux}} = d \cdot ws_{\text{main}}$ , guaranteeing

$\frac{\text{FS}_{\text{main}}}{\text{FS}_{\text{aux}}} = \frac{ws_{\text{aux}}}{ws_{\text{main}}}$

This ensures each auxiliary window covers the identical physical area as its main path counterpart, so multi-scale features fuse over congruent spatial support. Fusion is a direct element-wise addition, $O = O_1 + O_2$ , producing an output feature map of the same size.

4. Adaptive Window Fusion, Positional Encoding, and WSF

Within WinMambaLayer, explicit 3D positional encoding is injected via a learned MLP:

$p = \omega_2(\mathrm{ReLU}(\mathrm{BN}(\omega_1 e + b_1))) + b_2, \qquad f' = f + p$

where $e = (i, j, k)$ is the voxel index.

Window Shift Fusion (WSF) addresses the loss of context at window boundaries. For each windowed grid, a spatial shift $\Delta = (w_x/2, w_y/2, w_z/2)$ is applied, yielding two offset partitions. Both original and shifted partitions are serialized, concatenated, passed through a single Mamba SSM, then split and mapped back to their 3D locations. The fused output,

$F_{\text{out}} = F_0 + F_1,$

delivers overlapping, dense window coverage, mitigating artifacts for objects straddling window edges.

AWF, within its multi-part structure (dual-stream encoding, feature bridging, collaborative decoding), provides multi-scale context aggregation and replaces traditional FPN residuals.

5. Integration of Window Modules and Block Design

A single WinMamba Layer includes:

Positional encoding augmentation.
Alternating axis partition (X, then Y).
Per-axis application of WSF using a unified Mamba SSM across original and shifted window partitions.

A WinMamba Block operates with:

Parallel main and auxiliary WinMamba Layers (with respective window sizes $ws$ and $d\cdot ws$ ).
Scale- and location-aligned feature fusion ( $+$ ).
Auxiliary path fulfilling detail-preserving and residual functions, obviating external skip-connections.

This blockwise construction is repeated to form the backbone, with spatial downsampling progression per FPN stage.

6. Training Setup and Evaluation Protocols

Implementation leverages OpenPCDet and LION-Mamba training recipes. Representative hyperparameters:

Dataset	Epochs	Batch Size	LR	Hardware	Optimizer
KITTI	80	4	$3 \times 10^{-3}$	2× RTX3090 GPU	AdamW
Waymo	36	2	$3 \times 10^{-3}$	2× RTX3090 GPU	AdamW

KITTI uses 7.5K training examples, official validation split, and reports AP $_{3D}$ (R11) with Car (IoU 0.7), Pedestrian and Cyclist (IoU 0.5). Waymo leverages 160K training frames (20% sampled), with AP/APH (L1 & L2) metrics.

7. Quantitative Performance and Ablation Studies

On KITTI validation, WinMamba advances mAP from 70.8 (baseline) to 73.7 (+2.9), with sizable gains on Pedestrian (+4.0) and Cyclist (+3.1) classes. Ablation experiments detail:

Configuration	mAP (Car/Ped/Cyc, moderate)	Gain
Baseline (no WSF/AWF)	70.8	—
+WSF only	72.2	+1.4
+WSF + AWF	73.5	+2.7

AWF subcomponent ablation confirms dual-stream encoding and collaborative decode as crucial, raising mAP from 72.7 (PartA+B) to 73.5 (+PartC), while fallback to original FPN residual diminishes performance to 71.2.

On Waymo (20% training set), WinMamba achieves 73.6 mAP and 71.5 mAPH (L2), improving over baseline (73.0, 71.0).

These findings demonstrate that WinMamba, by synergizing linear-complexity SSMs with multi-scale and cross-window fusion—without increasing computational burden—sets a new standard for efficient 3D detection performance (Zheng et al., 17 Nov 2025).

Markdown Upgrade to Chat

References (1)

WinMamba: Multi-Scale Shifted Windows in State Space Model for 3D Object Detection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WinMamba.

WinMamba: 3D SSM Backbone for LiDAR

1. System Architecture

2. Linear State-Space Model Design

3. Window-Scale-Adaptive Sampling

4. Adaptive Window Fusion, Positional Encoding, and WSF

5. Integration of Window Modules and Block Design

6. Training Setup and Evaluation Protocols

7. Quantitative Performance and Ablation Studies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

WinMamba: 3D SSM Backbone for LiDAR

1. System Architecture

2. Linear State-Space Model Design

3. Window-Scale-Adaptive Sampling

4. Adaptive Window Fusion, Positional Encoding, and WSF

5. Integration of Window Modules and Block Design

6. Training Setup and Evaluation Protocols

7. Quantitative Performance and Ablation Studies

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research