C2PSA Module: Dual-Branch Attention

Updated 2 December 2025

C2PSA module is a dual-branch attention unit that integrates lightweight channel and spatial self-attention mechanisms for enhanced feature representation.
It seamlessly fits into YOLOv11 architectures by operating in parallel with convolutional layers, preserving feature map dimensions through residual connections.
Empirical studies show that C2PSA and its TriAtt variant boost detection accuracy and efficiency in challenging small-target scenarios such as agriculture and industrial monitoring.

The C2PSA (Channel-and-Spatial Self-Attention) module is a deep neural network architectural unit designed to enhance feature representation by integrating lightweight channel and spatial self-attention mechanisms. Originally developed as a plug-in to the YOLOv11 object detection architecture, C2PSA targets the extraction of fine-grained cues in small object detection under challenging conditions, such as real-time agricultural and industrial remote sensing. The module addresses the limitations of traditional convolutional backbones, particularly in accurately identifying small-scale targets amidst complex, noisy, or imbalanced inputs (Wang et al., 17 Aug 2025, Li, 9 Feb 2025).

1. Block Architecture and Integration

The C2PSA module operates within modern object detectors by inserting parallel attention pathways at critical points in the network’s dataflow. In YOLOv11-based systems, it is positioned immediately after the last C3k2 block within the backbone and after key path aggregation operations in the feature pyramid neck. The module is configured to accept and output feature maps with identical dimensions,

$X \in \mathbb{R}^{B \times C \times H \times W}$

where $B$ is batch size, $C$ is channel count, and $H \times W$ indicates spatial resolution. This seamless compatibility ensures that subsequent layers—such as spatial pyramid pooling (SPPF) or upsample-merge joins—require no modification (Wang et al., 17 Aug 2025).

The internal structure consists of two attention branches:

Channel Self-Attention, which globally re-weights channels.
Spatial Self-Attention, which re-weights spatial locations based on their relevance.

The outputs from these branches are aggregated, merged with the original input via residual connection, and projected through lightweight normalization and activation layers to form the final module response.

2. Mathematical Operations

The C2PSA module formulates independent channel and spatial self-attention maps using linear projections, softmax normalization, and weighted summation in the feature tensor domain.

Channel Self-Attention: Feature maps are reshaped and projected to query, key, and value tensors with reduced channel dimensions ( $d_k = C/r$ , $r \in \{8,16\}$ ), followed by matrix multiplication, scaling, softmax, and application to recalibrate channel responses via,

$Q_c = W_q^c\,\mathrm{reshape}(X), \quad K_c = W_k^c\,\mathrm{reshape}(X), \quad V_c = W_v^c\,\mathrm{reshape}(X)$

$A_c = \mathrm{Softmax} \left( \frac{Q_c K_c^T}{\sqrt{d_k}} \right), \qquad O_c = A_c V_c$

Spatial Self-Attention: Features are permuted to $(B, N, C)$ , linearly projected, and similarly recombined to yield spatial importance maps,

$Q_s = W_q^s\,\mathrm{permute\_reshape}(X), \quad K_s = W_k^s\,\mathrm{permute\_reshape}(X), \quad V_s = W_v^s\,\mathrm{permute\_reshape}(X)$

$A_s = \mathrm{Softmax} \left( \frac{Q_s K_s^T}{\sqrt{d_k}} \right),\qquad O_s = A_s V_s$

Fusion: The final module response is computed as

$Y = X + \alpha O_c + \beta O_s$

where $\alpha$ and $\beta$ are learnable scalars, typically followed by $1 \times 1$ convolution, batch normalization, and activation (Wang et al., 17 Aug 2025).

3. Implementation Variants and Enhancements

Two major C2PSA block realizations are described in the literature:

Variant	Attention Type	Distinctive Feature
Original C2PSA	Channel & Spatial (PSA head)	CSP split, concatenation, Polarized Self-Attention
C2PSA-TriAtt	Triplet + Channel & Spatial (PSA)	Triplet Attention post-1x1 Conv; explicit 3D reweight

The original C2PSA implementation in YOLOv11 deploys cross-stage partial (CSP) splitting for efficient aggregation, followed by a Polarized Self-Attention (PSA) operator that applies lightweight global and local interactions on the fused features (Li, 9 Feb 2025). C2PSA-TriAtt augments the initial 1×1-convolution output with a residual Triplet Attention block, which explicitly attends to correlations along (Channel–Width), (Height–Channel), and (Height–Width) axes by z-pooling, convolution, and mask-based reweighting. This explicit cross-dimensional modeling is found to boost mAP by 0.5% for a parameter/FLOPs increase of approximately 0.1% at each block (Li, 9 Feb 2025).

4. Empirical Performance and Efficiency

Extensive ablation studies are reported for C2PSA modules integrated within YOLOv11 under real-world agricultural and industrial detection tasks.

For small-target cotton disease detection (Wang et al., 17 Aug 2025):

The baseline YOLOv11S (no C2PSA): mAP₅₀=0.759, mAP₅₀–₉₅=0.638, Precision@small=0.70, Recall@small=0.76, params ≈ 2.18M, FLOPs ≈ 5.8 GFLOPS, FPS=158.
YOLOv11S+C2PSA: mAP₅₀=0.820 (+6.1 pp), mAP₅₀–₉₅=0.705 (+6.7 pp), Precision@small=0.77 (+7 pp), Recall@small=0.82 (+6 pp), params ≈ 2.45M, FLOPs ≈ 6.5 GFLOPS (+12%). Inference FPS=147 (−7%), remaining real-time.

For coal gangue detection with strict edge constraints (Li, 9 Feb 2025):

Introduction of C2PSA-TriAtt reduces model size and computational cost by ~40% while mAP improves by 0.5% relative to unaugmented C2PSA, with per-block parameter growth under 0.1% and virtually unchanged runtime per image.

This suggests C2PSA modules deliver substantial accuracy improvements in small-target detection at modest computational cost, rendering them suitable for resource-constrained edge- and mobile-based inference.

5. Theoretical Properties and Design Rationale

The dual-branch design enables the module to explicitly recover both “what” (channel) and “where” (spatial) information lost in pure convolutional pipelines. Channel attention assigns greater importance to weak or rare features that might otherwise be overwhelmed by dominant background activations, which is especially influential for rare small-lesion detection. Spatial attention, in parallel, focuses on spatially localized responses even for token-sized objects. The explicit aggregation and lightweight projections are selected to preserve computational tractability while maintaining feature diversity and expressivity (Wang et al., 17 Aug 2025, Li, 9 Feb 2025).

A plausible implication is that such dual self-attention mechanisms, when appropriately parameterized, improve robustness not only to small-target occlusion but also to intra-class variation and scale jitter, as validated across multi-disease and variable field conditions in cited agricultural deployments.

6. Practical Integration in Detection Pipelines

C2PSA is architected for ease of integration:

Input and output sizes match those of the preceding and following convolutional layers.
In YOLOv11, insertion locations are after major backbone and intermediate neck stages; detection heads remain unaltered.
Standard feature pyramid resolutions (80×80, 40×40, 20×20 with channels 256/512/1024) are directly compatible; typical per-block overhead remains minor (∼0.23M params per block at C=256).
On edge platforms, further compression is possible via C2PSA-TriAtt; overhead for three 7×7 convolutional branches per block totals <0.01M params for the full FPN (Li, 9 Feb 2025).

These properties permit deployment in mobile and industrial real-time monitoring at minimal resource cost without loss of key detection accuracy.

7. Impact and Extensions

C2PSA has demonstrated significant empirical impact in precision agricultural vision (cotton disease spot detection with sub-5mm² lesions) and industrial quality control (coal gangue identification in edge devices). Beyond their initial context, C2PSA modules—alongside Triplet Attention-enhanced variants—represent a broader class of lightweight attention mechanisms that address the representational bottlenecks of convolutional neural networks in small-object, high-variation scenarios.

Future research may assess the transferability of these modules to other low-resource detection problems and explore dynamic routing or even lighter-weight normalization strategies for further efficiency gains.

Key references: (Wang et al., 17 Aug 2025, Li, 9 Feb 2025)