Strip Attention Block (SAB) Overview

Updated 14 September 2025

Strip Attention Block (SAB) is an architectural module that efficiently captures long-range directional dependencies using sequential strip convolutions, reducing quadratic complexity to linear.
It employs asymmetric horizontal and vertical convolutions to aggregate directional context for applications like medical image segmentation, 3D object detection, and image restoration.
Empirical results show SABs improve accuracy and efficiency by lowering model parameters and FLOPs, making them ideal for real-time and resource-constrained environments.

A Strip Attention Block (SAB) is an architectural module designed to efficiently capture long-range directional dependencies in high-dimensional inputs such as images, feature maps, or point clouds. SABs are characterized by their use of computationally lightweight, spatially anisotropic attention mechanisms—typically implemented via asymmetric or strip-wise convolutions that operate along one dimension (horizontal or vertical)—to aggregate context without incurring the quadratic complexity of conventional self-attention. SABs have been developed and applied in diverse domains, including 2D medical image segmentation, super-resolution, real-time semantic segmentation, 3D object detection, and remote sensing, with documented benefits in accuracy, efficiency, and hardware compatibility.

1. Structural Principles and Core Mechanisms

The defining feature of the SAB is the decomposition of conventional 2D convolution or attention operations into two sequential spatially asymmetric “strip” convolutions, typically a $1 \times K$ horizontal kernel followed by a $K \times 1$ vertical kernel, where $K$ denotes the strip length. This process transforms standard quadratic-complexity operations into two linear ones:

$\begin{aligned} \mathbf{F}_h &= \mathrm{DWConv}_{1 \times K}(\mathbf{F}_p) \ \mathbf{F}_v &= \mathrm{DWConv}_{K \times 1}(\mathbf{F}_h) \end{aligned}$

with $\mathbf{F}_p$ an initial projected feature map, as in (Wang et al., 7 Sep 2025).

The attention map $\mathbf{A}$ is typically computed via a pointwise convolution and nonlinear activation (e.g., sigmoid):

$\mathbf{A} = \mathrm{PWConv}(\mathbf{F}_v)$

The final output refines the original feature map through an element-wise product:

$\mathbf{F}' = \mathrm{GeLU}(\mathrm{Linear}(\mathbf{F}_0)) \odot \mathbf{A}$

This factorization preserves long-range information along each spatial axis (row or column), with complexity $O(K)$ per strip convolution—yielding linear computation cost with respect to strip length, compared to $O(K^2)$ for standard 2D kernels. In multi-stage architectures, SABs can be stacked, overlapped, or combined with additional pooling, normalization, or multi-scale mechanisms, providing a flexible basis for context modeling.

2. Computational Efficiency and Directional Context Modeling

The linear complexity of SABs arises from the sequential application of directional convolutions or pooling, which restrict computation to one axis at a time. This design has been shown to:

Efficiently aggregate directional context—capturing elongated anatomical structures in medical images, as in block-wise local attention (Jiang et al., 2019).
Drastically reduce model parameters and FLOPs; for example, StripDet attains only 0.65M parameters and 79.97% mAP in car detection on KITTI, a seven-fold reduction compared to PointPillars (Wang et al., 7 Sep 2025).
Maintain or extend receptive field by stacking multiple SABs or using multi-scale kernel configurations (varying $K$ ), as in image restoration (Hao et al., 26 Jul 2024) and dehazing (Tong et al., 9 May 2024).

In practical applications, SABs frequently outperform alternatives such as Point-wise Spatial Attention (PSA) in both segmentation accuracy and efficiency; e.g., DABs (stacked block-wise attention) achieve Dice coefficients up to 0.93 for mandible segmentation at negligible parameter overhead (Jiang et al., 2019). Table 1 summarizes SAB complexity versus standard convolutions:

Operation	Complexity	Receptive Field	Suitable For
2D Conv ( $K\times K$ )	$O(K^2)$	Square	Isotropic context
SAB (strip)	$O(K)$	Rectilinear	Anisotropic context

3. Design Variants & Integration Strategies

a) SABs for Dense Prediction and Segmentation

SABs underpin several recent efficient segmentation networks. For example, S $^2$ -FPN employs a Scale-aware Strip Attention Module (SSAM) that aggregates vertical context for road scene segmentation, using parallel width-wise pooling followed by $1\times1$ convolutions and element-wise fusion:

$F_\text{SSAM} = \alpha \cdot F_\text{scale} + (\alpha - 1) \cdot F$

where $F_\text{scale}$ amplifies features along the strip, and $\alpha$ is a learned weight (Elhassan et al., 2022).

SCASeg further compresses queries and keys into strip-like patterns at the attention-head level (projecting channel dimension to one), minimizing computational overhead and memory consumption for large-scale semantic segmentation (Xu et al., 26 Nov 2024).

b) SABs in 3D and Point Cloud Object Detection

StripDet leverages SABs to process 3D point cloud data, where the inherent sparsity and spatial irregularity require lightweight, direction-aware feature extraction. SABs are combined with depthwise separable convolutions and simple multiscale fusion, bypassing heavy pyramid designs (Wang et al., 7 Sep 2025).

c) SABs for Image Restoration and Super-Resolution

Dilated Strip Attention Blocks (DSAB) use dilated strip convolutions to further enlarge the receptive field while keeping parameter counts low, beneficial for dehazing, deblurring, and desnowing (Hao et al., 26 Jul 2024). In SpikeSR, SABs regulate spiking activity by bridging temporal-channel correlations, employing attention-weighted membrane potential modulation for remote sensing SR (Xiao et al., 6 Mar 2025).

d) SABs and Self-Attention Compression

S2AFormer features Strip Self-Attention Blocks that spatially downsample K and V and compress Q and K channel dimensions to single-channel “strips,” that is, $Q \in \mathbb{R}^{N \times 1}$ , $K \in \mathbb{R}^{N_s \times 1}$ , $N_s = N / k^2$ , yielding $O(N \times N_s)$ complexity (Xu et al., 28 May 2025).

4. Quantitative Performance & Empirical Outcomes

Empirical studies consistently document SAB-related modules outperforming baselines in both accuracy and efficiency:

Dual block-wise attention (DAB) in organ segmentation achieved Dice coefficients of $0.85 \pm 0.04$ to $0.93 \pm 0.01$ with only $0.15\%$ parameter increase and $66.7\%$ increase in computation time, versus comparable accuracy but $8.14\%$ parameter and $516.7\%$ time increase for PSA (Jiang et al., 2019).
SAB in StripDet delivered 79.97% mAP with only 0.65M parameters on KITTI (Wang et al., 7 Sep 2025).
S $^2$ -FPN’s SSAM improved mIoU to $76.2\%$ at $87.3$ FPS using ResNet18, with even higher mIoU using heavier backbones (Elhassan et al., 2022).
Dilated SAB in DSAN achieved 40.60 dB PSNR for dehazing (SOTS-Indoor), outperforming transformer-based models with similar parameter count (Hao et al., 26 Jul 2024).

5. Integration, Hardware Considerations, and Scalability

SABs are integrated into various model backbones with the following design choices:

Placement as bridging modules between encoder and decoder (e.g., MALUNet (Ruan et al., 2022)).
Residual stacking and combination with local enhancement blocks (e.g., HPB in S2AFormer (Xu et al., 28 May 2025)).
Pairing with depthwise separable convolutions for hardware efficiency (Wang et al., 7 Sep 2025).
Multi-scale fusion via feature concatenation rather than deep pyramid hierarchies, facilitating streamlined computation.
Compatibility with standard frameworks (PyTorch, ONNX) and optimal for deployment on edge or mobile hardware.

6. Conceptual Extensions and Relation to Sparse/Band Attention

The SAB concept is closely related to block-wise attention, mix-of-experts attention, and band attention. MoBA (Lu et al., 18 Feb 2025) generalizes the block-wise paradigm by introducing a dynamic gating mechanism for block selection—suggesting that SAB may be interpreted as a static form of block sparse attention, whereas MoBA provides adaptive routing. SAB is also distinct from banded or criss-cross attention in scope and parameterization but shares the objective of reducing computational complexity for long context modeling.

7. Future Directions and Research Implications

Papers frequently suggest that SABs can be extended for:

Broader application in tasks requiring long-range dependency modeling with anisotropic spatial structure—such as segmentation, detection, restoration, and super-resolution.
Adaptive multi-scale kernel and strip dimensionality—allowing the SAB to dynamically adjust receptive fields in response to task needs (Tong et al., 9 May 2024, Hao et al., 26 Jul 2024).
Integration with constant memory or permutation-invariant attention blocks for efficient online updates (Feng et al., 2023).
Combined use with hardware-friendly convolutional strategies for real-time and low-resource environments.

This body of work points toward the SAB as a highly adaptable, efficient building block for modern neural architectures, balancing context aggregation with practical computational constraints and directional feature sensitivity.