Papers
Topics
Authors
Recent
2000 character limit reached

WinMamba: 3D SSM Backbone for LiDAR

Updated 24 November 2025
  • WinMamba is a pure state-space model-based 3D backbone that employs linear-complexity Mamba kernels and custom window fusion strategies for efficient LiDAR object detection.
  • It integrates Window Shift Fusion and Adaptive Window Fusion to overcome fixed-window limitations, ensuring superior multi-scale spatial representation and robust context recovery.
  • Extensive evaluations on KITTI and Waymo datasets show significant improvements in mAP, demonstrating its performance enhancements over traditional SSM baselines.

WinMamba is a pure state-space model (SSM)-based 3D backbone, designed to provide a computationally efficient, long-range modeling framework for high-fidelity voxel feature extraction in LiDAR-based object detection. By integrating linear-complexity Mamba kernels with two custom window-based fusion strategies—Window Shift Fusion (WSF) and Adaptive Window Fusion (AWF)—WinMamba addresses key limitations of conventional axis-aligned, fixed-window methods, enabling superior multi-scale spatial representation and robust cross-window context recovery. Extensive evaluation on standard autonomous driving datasets demonstrates significant performance improvements over strong SSM baselines, as well as detailed ablation of architectural contributions (Zheng et al., 17 Nov 2025).

1. System Architecture

WinMamba replaces the standard 3D-sparse-convolution backbone in a voxel-based 3D detection pipeline with a series of N stacked WinMamba Blocks within a four-stage feature pyramid network (FPN). The canonical pipeline consists of:

  • Voxel Feature Encoder (VFE): Employs PointPillars or SECOND variants to encode the point cloud into a dense CC-dimensional embedding f∈RC×X×Y×Zf \in \mathbb{R}^{C \times X \times Y \times Z}.
  • 3D Backbone: Each stage is implemented by a WinMamba Block, which spatially downsamples by a factor dd, applies WinMamba Layers at both main (lower resolution) and auxiliary (higher resolution) streams, and fuses the outputs.
  • BEV Backbone & Detection Head: Follows established architectures such as LION-Mamba and CenterPoint.

The high-level computation in one WinMamba Block can be summarized by the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
def WinMambaBlock(I_in, d, ws):
    # I_in: %%%%3%%%% voxel features
    # d: downsampling factor; ws: window size (main path)
    # 1. Main (low-res) path
    I_main = Downsample(I_in, factor=d)
    O1 = WinMambaLayer(I_main, ws)
    # 2. Auxiliary (high-res) path
    ws_aux = d * ws
    O2 = Downsample(WinMambaLayer(I_in, ws_aux), d)
    # 3. Fuse
    O_out = O1 + O2
    return O_out

By stacking multiple blocks with increasing dd (typically 2×2 \times per stage), WinMamba achieves efficient multiscale feature representation suitable for downstream 3D detection.

2. Linear State-Space Model Design

At the core of each WinMamba Layer is a Mamba Block implementing a discrete linear SSM over serialized windowed voxel sequences. For a sequence u1,…,uLu_1,\ldots,u_L, the recursion is:

xt+1=Axt+But,yt=Cxt+Dutx_{t+1} = A x_t + B u_t, \quad y_t = C x_t + D u_t

where xt∈RNx_t \in \mathbb{R}^N (hidden state), ut∈RCu_t \in \mathbb{R}^C (input token), yt∈RCy_t \in \mathbb{R}^C (output token), and A,B,C,DA, B, C, D are shared and learnable. Structuring AA (e.g., diagonal plus low-rank) enables O(L)O(L) scan complexity for efficient long-range aggregation.

Operationally, each WinMambaLayer alternates applying the SSM along both X and Y axes: windows are partitioned, serialized, mapped by Mamba, and deserialized—preserving linear complexity while capturing two-dimensional spatial interactions.

3. Window-Scale-Adaptive Sampling

Window-Scale Adaptation (WSA) is integral to the AWF module, ensuring that main and auxiliary streams at differing resolutions maintain physical alignment of their receptive fields. The window size for the auxiliary stream is set as wsaux=dâ‹…wsmainws_{\text{aux}} = d \cdot ws_{\text{main}}, guaranteeing

FSmainFSaux=wsauxwsmain\frac{\text{FS}_{\text{main}}}{\text{FS}_{\text{aux}}} = \frac{ws_{\text{aux}}}{ws_{\text{main}}}

This ensures each auxiliary window covers the identical physical area as its main path counterpart, so multi-scale features fuse over congruent spatial support. Fusion is a direct element-wise addition, O=O1+O2O = O_1 + O_2, producing an output feature map of the same size.

4. Adaptive Window Fusion, Positional Encoding, and WSF

Within WinMambaLayer, explicit 3D positional encoding is injected via a learned MLP:

p=ω2(ReLU(BN(ω1e+b1)))+b2,f′=f+pp = \omega_2(\mathrm{ReLU}(\mathrm{BN}(\omega_1 e + b_1))) + b_2, \qquad f' = f + p

where e=(i,j,k)e = (i, j, k) is the voxel index.

Window Shift Fusion (WSF) addresses the loss of context at window boundaries. For each windowed grid, a spatial shift Δ=(wx/2,wy/2,wz/2)\Delta = (w_x/2, w_y/2, w_z/2) is applied, yielding two offset partitions. Both original and shifted partitions are serialized, concatenated, passed through a single Mamba SSM, then split and mapped back to their 3D locations. The fused output,

Fout=F0+F1,F_{\text{out}} = F_0 + F_1,

delivers overlapping, dense window coverage, mitigating artifacts for objects straddling window edges.

AWF, within its multi-part structure (dual-stream encoding, feature bridging, collaborative decoding), provides multi-scale context aggregation and replaces traditional FPN residuals.

5. Integration of Window Modules and Block Design

A single WinMamba Layer includes:

  • Positional encoding augmentation.
  • Alternating axis partition (X, then Y).
  • Per-axis application of WSF using a unified Mamba SSM across original and shifted window partitions.

A WinMamba Block operates with:

  • Parallel main and auxiliary WinMamba Layers (with respective window sizes wsws and dâ‹…wsd\cdot ws).
  • Scale- and location-aligned feature fusion (++).
  • Auxiliary path fulfilling detail-preserving and residual functions, obviating external skip-connections.

This blockwise construction is repeated to form the backbone, with spatial downsampling progression per FPN stage.

6. Training Setup and Evaluation Protocols

Implementation leverages OpenPCDet and LION-Mamba training recipes. Representative hyperparameters:

Dataset Epochs Batch Size LR Hardware Optimizer
KITTI 80 4 3×10−33 \times 10^{-3} 2× RTX3090 GPU AdamW
Waymo 36 2 3×10−33 \times 10^{-3} 2× RTX3090 GPU AdamW

KITTI uses 7.5K training examples, official validation split, and reports AP3D_{3D}(R11) with Car (IoU 0.7), Pedestrian and Cyclist (IoU 0.5). Waymo leverages 160K training frames (20% sampled), with AP/APH (L1 & L2) metrics.

7. Quantitative Performance and Ablation Studies

On KITTI validation, WinMamba advances mAP from 70.8 (baseline) to 73.7 (+2.9), with sizable gains on Pedestrian (+4.0) and Cyclist (+3.1) classes. Ablation experiments detail:

Configuration mAP (Car/Ped/Cyc, moderate) Gain
Baseline (no WSF/AWF) 70.8 —
+WSF only 72.2 +1.4
+WSF + AWF 73.5 +2.7

AWF subcomponent ablation confirms dual-stream encoding and collaborative decode as crucial, raising mAP from 72.7 (PartA+B) to 73.5 (+PartC), while fallback to original FPN residual diminishes performance to 71.2.

On Waymo (20% training set), WinMamba achieves 73.6 mAP and 71.5 mAPH (L2), improving over baseline (73.0, 71.0).

These findings demonstrate that WinMamba, by synergizing linear-complexity SSMs with multi-scale and cross-window fusion—without increasing computational burden—sets a new standard for efficient 3D detection performance (Zheng et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to WinMamba.