Papers
Topics
Authors
Recent
Search
2000 character limit reached

Strip Pooling Module (SPM)

Updated 1 May 2026
  • SPM is a spatial pooling mechanism that uses horizontal and vertical strip kernels to efficiently capture long-range dependencies while preserving local details.
  • It integrates 1D convolutions with strip pooling to generate attention maps, enhancing feature representation for pixel-wise prediction tasks.
  • Applying SPM in architectures like SPNet yields state-of-the-art performance on benchmarks such as ADE20K and Cityscapes with a reduced parameter count.

The Strip Pooling Module (SPM) is a mechanism for spatial pooling introduced to improve contextual representation in pixel-wise prediction networks, especially in scene parsing. SPM redefines spatial pooling by employing long, narrow kernels—specifically 1×N1 \times N (horizontal strips) or N×1N \times 1 (vertical strips)—to efficiently model long-range dependencies along a single spatial axis while minimizing the loss of spatial detail inherent in conventional N×NN \times N pooling. The module is designed for plug-and-play integration, lightweight computation, and strong ablation-tested performance gains in architectures such as those based on ResNet backbones. SPM is a central building block in SPNet, a scene parsing network that achieves state-of-the-art performance on ADE20K and Cityscapes benchmarks with reduced parameter count compared to prior spatial pooling approaches (Hou et al., 2020).

1. Core Architectural Concept

SPM operates on an input feature tensor XRC×H×WX \in \mathbb{R}^{C \times H \times W}, where CC denotes channels, HH height, and WW width. The module applies two parallel pooling operations:

  • Horizontal Strip Pooling: Utilizes a kernel of size (1×W)(1 \times W), aggregating information across each row to produce yhRC×Hy^h \in \mathbb{R}^{C \times H}.
  • Vertical Strip Pooling: Employs a kernel of size (H×1)(H \times 1), aggregating information across each column to produce N×1N \times 10.

After strip pooling, each output is processed with a 1D convolution (kernel size 3), which enables local communication along the pooled axis. The outputs N×1N \times 11 and N×1N \times 12 are then broadcast-added to reconstruct a spatial map N×1N \times 13 over the full N×1N \times 14 grid.

A subsequent N×1N \times 15 convolution and sigmoid activation produce a spatial attention map N×1N \times 16. The final output N×1N \times 17 is the element-wise product of the input N×1N \times 18 and attention N×1N \times 19: N×NN \times N0. Each spatial location N×NN \times N1 in N×NN \times N2 is thus informed by the context of its entire row and column in N×NN \times N3 but without intermingling large, square 2D neighborhoods (Hou et al., 2020).

2. Formal Mathematical Formulation

Strip pooling operations are defined for each channel N×NN \times N4 as:

N×NN \times N5

where N×NN \times N6 (horizontal) is constant along N×NN \times N7 and N×NN \times N8 (vertical) is constant along N×NN \times N9. After 1D convolutions on these pooled features, the result is merged by broadcasting:

  • Horizontal: XRC×H×WX \in \mathbb{R}^{C \times H \times W}0
  • Vertical: XRC×H×WX \in \mathbb{R}^{C \times H \times W}1

The summed result XRC×H×WX \in \mathbb{R}^{C \times H \times W}2 is passed through XRC×H×WX \in \mathbb{R}^{C \times H \times W}3 convolution and sigmoid, yielding an attention tensor for feature map reweighting.

3. Module Internal Structure

The SPM is structured as follows:

  • Input: XRC×H×WX \in \mathbb{R}^{C \times H \times W}4
  • Path A (Horizontal):
    • Strip pool each row: XRC×H×WX \in \mathbb{R}^{C \times H \times W}5 of XRC×H×WX \in \mathbb{R}^{C \times H \times W}6
    • 1D convolution (kernel 3) XRC×H×WX \in \mathbb{R}^{C \times H \times W}7
  • Path B (Vertical):
    • Strip pool each column: XRC×H×WX \in \mathbb{R}^{C \times H \times W}8 of XRC×H×WX \in \mathbb{R}^{C \times H \times W}9
    • 1D convolution (kernel 3) CC0
  • Merge: Broadcast CC1 and CC2 to shape CC3, sum to obtain CC4
  • Attention: CC5 convolution CC6 sigmoid CC7
  • Output: CC8

All strip and pointwise convolutions are typically followed by BatchNorm and ReLU, aside from the final CC9 convolution before sigmoid. The parameter count is minimal: each 1D convolution introduces HH0 parameters, and the HH1 convolution has HH2 parameters (Hou et al., 2020).

4. Integration into Backbone Architectures

SPM can be inserted into standard ResNet architectures using the following prescribed pattern:

  • Conduct “dilated” convolutions in the last two ResNet stages (4 and 5), so the final feature stride is HH3.
  • Insert an SPM after the HH4 convolution in the final residual block of each of the first three stages.
  • For the last stage (typically 3–4 residual blocks), insert an SPM after every HH5 convolution.
  • The SPM output is integrated into the usual residual addition; no special skip connections required.
  • In code, this is implemented with a boolean flag activating SPM in late stages/residual blocks.

The module can be used as a plug-in attention block for any convolutional feature map of shape HH6. Standard training protocols apply; no alterations for loss functions, normalization, or learning rate schedules are required (Hou et al., 2020).

5. Hyperparameters and Ablation Insights

Key hyperparameters and ablation findings include:

  • Kernel Lengths: Strip pooling uses the full width HH7 (horizontal) and height HH8 (vertical) of the input tensor, providing global 1D context without further dilation.
  • Channel Widths: The module is channel-preserving; the 1D and HH9 convolutions maintain the channel size WW0.
  • Insertion Strategy: Maximum mIoU gain is realized by applying SPM primarily in deeper layers. Excessive use across all layers yields diminishing returns.

Ablation results on ADE20K (ResNet-50, single-scale test):

Module mIoU (%)
Base FCN 37.63
+SPM (late) 41.66
+2 MPM 41.92
+2 MPM + SPM 44.03

6. Empirical Performance

SPM confers significant improvements in scene parsing benchmarks:

  • ADE20K (ResNet-50, single model, single-scale):
    • Base FCN: 37.63% mIoU
    • +PPM (PSPNet): 41.68% mIoU
    • +2 MPM: 41.92% mIoU
    • +2 MPM + SPM (SPNet-50): 44.03% mIoU
  • ADE20K (multi-scale+flip):
    • SPNet-50: 45.03% mIoU
    • SPNet-101: 45.60% mIoU
  • Cityscapes (test, ResNet-101, fine only):
    • DANet / CCNet / APCNet: 81.4–81.5% mIoU
    • SPNet: 82.0% mIoU
  • Parameter Overhead:
    • ResNet-50: 27.7M (base), with SPM+MPM: 39.6M, which is ~9M fewer than PPM/PSPNet (48.7M).

SPNet, which extensively uses SPM, surpasses PSPNet’s accuracy with substantially fewer parameters. This efficiency makes SPM attractive for large-scale, high-performance semantic segmentation tasks (Hou et al., 2020).

7. Practical Implementation Guidance

  • SPM is modular and can be integrated into any WW1 convolutional feature map.
  • Strip pooling’s single-axis context aggregation preserves local spatial detail more effectively than global (square) pooling while yielding full-image context in each direction.
  • Optimal cost-accuracy tradeoff is achieved by restricting SPM to moderate-resolution, late-stage feature maps (e.g., with WW2 and WW3 at WW4 or WW5 spatial scale).
  • Combining SPM with the lightweight mixed pooling module (MPM), which focuses on close-range context via pyramid pools, further enhances performance.
  • Typical networks require only a modest number of SPM insertions—e.g., last block of each ResNet stage and all blocks in the final stage—to achieve WW6 mIoU gain on ADE20K.
  • No special training tricks are necessary; SPM integrates seamlessly with standard learning rate schedules and loss functions (Hou et al., 2020).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Strip Pooling Module (SPM).