WinMamba: 3D SSM Backbone for LiDAR
- WinMamba is a pure state-space model-based 3D backbone that employs linear-complexity Mamba kernels and custom window fusion strategies for efficient LiDAR object detection.
- It integrates Window Shift Fusion and Adaptive Window Fusion to overcome fixed-window limitations, ensuring superior multi-scale spatial representation and robust context recovery.
- Extensive evaluations on KITTI and Waymo datasets show significant improvements in mAP, demonstrating its performance enhancements over traditional SSM baselines.
WinMamba is a pure state-space model (SSM)-based 3D backbone, designed to provide a computationally efficient, long-range modeling framework for high-fidelity voxel feature extraction in LiDAR-based object detection. By integrating linear-complexity Mamba kernels with two custom window-based fusion strategies—Window Shift Fusion (WSF) and Adaptive Window Fusion (AWF)—WinMamba addresses key limitations of conventional axis-aligned, fixed-window methods, enabling superior multi-scale spatial representation and robust cross-window context recovery. Extensive evaluation on standard autonomous driving datasets demonstrates significant performance improvements over strong SSM baselines, as well as detailed ablation of architectural contributions (Zheng et al., 17 Nov 2025).
1. System Architecture
WinMamba replaces the standard 3D-sparse-convolution backbone in a voxel-based 3D detection pipeline with a series of N stacked WinMamba Blocks within a four-stage feature pyramid network (FPN). The canonical pipeline consists of:
- Voxel Feature Encoder (VFE): Employs PointPillars or SECOND variants to encode the point cloud into a dense -dimensional embedding .
- 3D Backbone: Each stage is implemented by a WinMamba Block, which spatially downsamples by a factor , applies WinMamba Layers at both main (lower resolution) and auxiliary (higher resolution) streams, and fuses the outputs.
- BEV Backbone & Detection Head: Follows established architectures such as LION-Mamba and CenterPoint.
The high-level computation in one WinMamba Block can be summarized by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 |
def WinMambaBlock(I_in, d, ws): # I_in: %%%%3%%%% voxel features # d: downsampling factor; ws: window size (main path) # 1. Main (low-res) path I_main = Downsample(I_in, factor=d) O1 = WinMambaLayer(I_main, ws) # 2. Auxiliary (high-res) path ws_aux = d * ws O2 = Downsample(WinMambaLayer(I_in, ws_aux), d) # 3. Fuse O_out = O1 + O2 return O_out |
By stacking multiple blocks with increasing (typically per stage), WinMamba achieves efficient multiscale feature representation suitable for downstream 3D detection.
2. Linear State-Space Model Design
At the core of each WinMamba Layer is a Mamba Block implementing a discrete linear SSM over serialized windowed voxel sequences. For a sequence , the recursion is:
where (hidden state), (input token), (output token), and are shared and learnable. Structuring (e.g., diagonal plus low-rank) enables scan complexity for efficient long-range aggregation.
Operationally, each WinMambaLayer alternates applying the SSM along both X and Y axes: windows are partitioned, serialized, mapped by Mamba, and deserialized—preserving linear complexity while capturing two-dimensional spatial interactions.
3. Window-Scale-Adaptive Sampling
Window-Scale Adaptation (WSA) is integral to the AWF module, ensuring that main and auxiliary streams at differing resolutions maintain physical alignment of their receptive fields. The window size for the auxiliary stream is set as , guaranteeing
This ensures each auxiliary window covers the identical physical area as its main path counterpart, so multi-scale features fuse over congruent spatial support. Fusion is a direct element-wise addition, , producing an output feature map of the same size.
4. Adaptive Window Fusion, Positional Encoding, and WSF
Within WinMambaLayer, explicit 3D positional encoding is injected via a learned MLP:
where is the voxel index.
Window Shift Fusion (WSF) addresses the loss of context at window boundaries. For each windowed grid, a spatial shift is applied, yielding two offset partitions. Both original and shifted partitions are serialized, concatenated, passed through a single Mamba SSM, then split and mapped back to their 3D locations. The fused output,
delivers overlapping, dense window coverage, mitigating artifacts for objects straddling window edges.
AWF, within its multi-part structure (dual-stream encoding, feature bridging, collaborative decoding), provides multi-scale context aggregation and replaces traditional FPN residuals.
5. Integration of Window Modules and Block Design
A single WinMamba Layer includes:
- Positional encoding augmentation.
- Alternating axis partition (X, then Y).
- Per-axis application of WSF using a unified Mamba SSM across original and shifted window partitions.
A WinMamba Block operates with:
- Parallel main and auxiliary WinMamba Layers (with respective window sizes and ).
- Scale- and location-aligned feature fusion ().
- Auxiliary path fulfilling detail-preserving and residual functions, obviating external skip-connections.
This blockwise construction is repeated to form the backbone, with spatial downsampling progression per FPN stage.
6. Training Setup and Evaluation Protocols
Implementation leverages OpenPCDet and LION-Mamba training recipes. Representative hyperparameters:
| Dataset | Epochs | Batch Size | LR | Hardware | Optimizer |
|---|---|---|---|---|---|
| KITTI | 80 | 4 | 2× RTX3090 GPU | AdamW | |
| Waymo | 36 | 2 | 2× RTX3090 GPU | AdamW |
KITTI uses 7.5K training examples, official validation split, and reports AP(R11) with Car (IoU 0.7), Pedestrian and Cyclist (IoU 0.5). Waymo leverages 160K training frames (20% sampled), with AP/APH (L1 & L2) metrics.
7. Quantitative Performance and Ablation Studies
On KITTI validation, WinMamba advances mAP from 70.8 (baseline) to 73.7 (+2.9), with sizable gains on Pedestrian (+4.0) and Cyclist (+3.1) classes. Ablation experiments detail:
| Configuration | mAP (Car/Ped/Cyc, moderate) | Gain |
|---|---|---|
| Baseline (no WSF/AWF) | 70.8 | — |
| +WSF only | 72.2 | +1.4 |
| +WSF + AWF | 73.5 | +2.7 |
AWF subcomponent ablation confirms dual-stream encoding and collaborative decode as crucial, raising mAP from 72.7 (PartA+B) to 73.5 (+PartC), while fallback to original FPN residual diminishes performance to 71.2.
On Waymo (20% training set), WinMamba achieves 73.6 mAP and 71.5 mAPH (L2), improving over baseline (73.0, 71.0).
These findings demonstrate that WinMamba, by synergizing linear-complexity SSMs with multi-scale and cross-window fusion—without increasing computational burden—sets a new standard for efficient 3D detection performance (Zheng et al., 17 Nov 2025).