Strip Pooling Module (SPM)

Updated 1 May 2026

SPM is a spatial pooling mechanism that uses horizontal and vertical strip kernels to efficiently capture long-range dependencies while preserving local details.
It integrates 1D convolutions with strip pooling to generate attention maps, enhancing feature representation for pixel-wise prediction tasks.
Applying SPM in architectures like SPNet yields state-of-the-art performance on benchmarks such as ADE20K and Cityscapes with a reduced parameter count.

The Strip Pooling Module (SPM) is a mechanism for spatial pooling introduced to improve contextual representation in pixel-wise prediction networks, especially in scene parsing. SPM redefines spatial pooling by employing long, narrow kernels—specifically $1 \times N$ (horizontal strips) or $N \times 1$ (vertical strips)—to efficiently model long-range dependencies along a single spatial axis while minimizing the loss of spatial detail inherent in conventional $N \times N$ pooling. The module is designed for plug-and-play integration, lightweight computation, and strong ablation-tested performance gains in architectures such as those based on ResNet backbones. SPM is a central building block in SPNet, a scene parsing network that achieves state-of-the-art performance on ADE20K and Cityscapes benchmarks with reduced parameter count compared to prior spatial pooling approaches (Hou et al., 2020).

1. Core Architectural Concept

SPM operates on an input feature tensor $X \in \mathbb{R}^{C \times H \times W}$ , where $C$ denotes channels, $H$ height, and $W$ width. The module applies two parallel pooling operations:

Horizontal Strip Pooling: Utilizes a kernel of size $(1 \times W)$ , aggregating information across each row to produce $y^h \in \mathbb{R}^{C \times H}$ .
Vertical Strip Pooling: Employs a kernel of size $(H \times 1)$ , aggregating information across each column to produce $N \times 1$ 0.

After strip pooling, each output is processed with a 1D convolution (kernel size 3), which enables local communication along the pooled axis. The outputs $N \times 1$ 1 and $N \times 1$ 2 are then broadcast-added to reconstruct a spatial map $N \times 1$ 3 over the full $N \times 1$ 4 grid.

A subsequent $N \times 1$ 5 convolution and sigmoid activation produce a spatial attention map $N \times 1$ 6. The final output $N \times 1$ 7 is the element-wise product of the input $N \times 1$ 8 and attention $N \times 1$ 9: $N \times N$ 0. Each spatial location $N \times N$ 1 in $N \times N$ 2 is thus informed by the context of its entire row and column in $N \times N$ 3 but without intermingling large, square 2D neighborhoods (Hou et al., 2020).

2. Formal Mathematical Formulation

Strip pooling operations are defined for each channel $N \times N$ 4 as:

$N \times N$ 5

where $N \times N$ 6 (horizontal) is constant along $N \times N$ 7 and $N \times N$ 8 (vertical) is constant along $N \times N$ 9. After 1D convolutions on these pooled features, the result is merged by broadcasting:

Horizontal: $X \in \mathbb{R}^{C \times H \times W}$ 0
Vertical: $X \in \mathbb{R}^{C \times H \times W}$ 1

The summed result $X \in \mathbb{R}^{C \times H \times W}$ 2 is passed through $X \in \mathbb{R}^{C \times H \times W}$ 3 convolution and sigmoid, yielding an attention tensor for feature map reweighting.

3. Module Internal Structure

The SPM is structured as follows:

Input: $X \in \mathbb{R}^{C \times H \times W}$ 4
Path A (Horizontal):
- Strip pool each row: $X \in \mathbb{R}^{C \times H \times W}$ 5 of $X \in \mathbb{R}^{C \times H \times W}$ 6
- 1D convolution (kernel 3) $X \in \mathbb{R}^{C \times H \times W}$ 7
Path B (Vertical):
- Strip pool each column: $X \in \mathbb{R}^{C \times H \times W}$ 8 of $X \in \mathbb{R}^{C \times H \times W}$ 9
- 1D convolution (kernel 3) $C$ 0
Merge: Broadcast $C$ 1 and $C$ 2 to shape $C$ 3, sum to obtain $C$ 4
Attention: $C$ 5 convolution $C$ 6 sigmoid $C$ 7
Output: $C$ 8

All strip and pointwise convolutions are typically followed by BatchNorm and ReLU, aside from the final $C$ 9 convolution before sigmoid. The parameter count is minimal: each 1D convolution introduces $H$ 0 parameters, and the $H$ 1 convolution has $H$ 2 parameters (Hou et al., 2020).

4. Integration into Backbone Architectures

SPM can be inserted into standard ResNet architectures using the following prescribed pattern:

Conduct “dilated” convolutions in the last two ResNet stages (4 and 5), so the final feature stride is $H$ 3.
Insert an SPM after the $H$ 4 convolution in the final residual block of each of the first three stages.
For the last stage (typically 3–4 residual blocks), insert an SPM after every $H$ 5 convolution.
The SPM output is integrated into the usual residual addition; no special skip connections required.
In code, this is implemented with a boolean flag activating SPM in late stages/residual blocks.

The module can be used as a plug-in attention block for any convolutional feature map of shape $H$ 6. Standard training protocols apply; no alterations for loss functions, normalization, or learning rate schedules are required (Hou et al., 2020).

5. Hyperparameters and Ablation Insights

Key hyperparameters and ablation findings include:

Kernel Lengths: Strip pooling uses the full width $H$ 7 (horizontal) and height $H$ 8 (vertical) of the input tensor, providing global 1D context without further dilation.
Channel Widths: The module is channel-preserving; the 1D and $H$ 9 convolutions maintain the channel size $W$ 0.
Insertion Strategy: Maximum mIoU gain is realized by applying SPM primarily in deeper layers. Excessive use across all layers yields diminishing returns.

Ablation results on ADE20K (ResNet-50, single-scale test):

Module	mIoU (%)
Base FCN	37.63
+SPM (late)	41.66
+2 MPM	41.92
+2 MPM + SPM	44.03

6. Empirical Performance

SPM confers significant improvements in scene parsing benchmarks:

ADE20K (ResNet-50, single model, single-scale):
- Base FCN: 37.63% mIoU
- +PPM (PSPNet): 41.68% mIoU
- +2 MPM: 41.92% mIoU
- +2 MPM + SPM (SPNet-50): 44.03% mIoU
ADE20K (multi-scale+flip):
- SPNet-50: 45.03% mIoU
- SPNet-101: 45.60% mIoU
Cityscapes (test, ResNet-101, fine only):
- DANet / CCNet / APCNet: 81.4–81.5% mIoU
- SPNet: 82.0% mIoU
Parameter Overhead:
- ResNet-50: 27.7M (base), with SPM+MPM: 39.6M, which is ~9M fewer than PPM/PSPNet (48.7M).

SPNet, which extensively uses SPM, surpasses PSPNet’s accuracy with substantially fewer parameters. This efficiency makes SPM attractive for large-scale, high-performance semantic segmentation tasks (Hou et al., 2020).

7. Practical Implementation Guidance

SPM is modular and can be integrated into any $W$ 1 convolutional feature map.
Strip pooling’s single-axis context aggregation preserves local spatial detail more effectively than global (square) pooling while yielding full-image context in each direction.
Optimal cost-accuracy tradeoff is achieved by restricting SPM to moderate-resolution, late-stage feature maps (e.g., with $W$ 2 and $W$ 3 at $W$ 4 or $W$ 5 spatial scale).
Combining SPM with the lightweight mixed pooling module (MPM), which focuses on close-range context via pyramid pools, further enhances performance.
Typical networks require only a modest number of SPM insertions—e.g., last block of each ResNet stage and all blocks in the final stage—to achieve $W$ 6 mIoU gain on ADE20K.
No special training tricks are necessary; SPM integrates seamlessly with standard learning rate schedules and loss functions (Hou et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Strip Pooling: Rethinking Spatial Pooling for Scene Parsing (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Strip Pooling Module (SPM).