Coordinate Attention in Neural Networks

Updated 7 June 2026

Coordinate Attention (CA) is an attention mechanism that decouples spatial aggregation into two 1D pooling operations, preserving positional information and capturing long-range context.
It recalibrates feature activations by fusing vertical and horizontal cues, enhancing performance in mobile vision and dense prediction tasks with minimal computational overhead.
The method underpins several extensions like Grouped Coordinate Attention (GCA) and multimodal fusion techniques, demonstrating improvements in metrics such as +2% top-1 accuracy on ImageNet.

Coordinate Attention (CA) is an attention mechanism for neural networks, originally formulated to bridge the gap between channel attention and spatial selectivity by enabling efficient and explicit encoding of both long-range context and precise positional information in convolutional backbones. Unlike traditional channel attention methods that collapse spatial structure through 2D global pooling, CA decouples spatial aggregation into two 1D pooling operations (one per axis), factorizing attention to capture global dependencies and positional cues along orthogonal directions. It has proven effective in both mobile vision models and a variety of high-resolution and dense prediction scenarios, as well as being foundational for later variants such as Grouped Coordinate Attention (GCA) and spectral/spatial extensions in multimodal fusion.

1. Principle and Mathematical Foundation

Coordinate Attention addresses the limitation of channel attention modules (e.g., Squeeze-and-Excitation, SE) that discard spatial location by compressing feature maps into global channel descriptors. In CA, the input tensor $X\in\mathbb{R}^{C\times H\times W}$ is aggregated by performing separate 1D global pooling along the vertical (height) and horizontal (width) axes:

Height pooling: $F_h(c,i) = \frac{1}{W} \sum_{j=1}^W X(c,i,j) \in \mathbb{R}^{C\times H\times 1}$
Width pooling: $F_w(c,j) = \frac{1}{H} \sum_{i=1}^H X(c,i,j) \in \mathbb{R}^{C\times 1\times W}$

The two descriptors are concatenated and passed through a shared $1\times1$ convolution, batch normalization, and activation (typically ReLU or similar):

$f = \text{concat}(F_h, F_w) \in \mathbb{R}^{C\times(H+W)}$
$f' = \delta(\text{BN}(\text{Conv}_{1\times1}(f))) \in \mathbb{R}^{C/r\times(H+W)}$

Splitting $f'$ restores the two directions:

$f'_h \in \mathbb{R}^{C/r \times H \times 1}$
$f'_w \in \mathbb{R}^{C/r \times 1 \times W}$

Each is projected via a $1\times1$ convolution to recover channel-wise scale and activated by sigmoid to yield:

$F_h(c,i) = \frac{1}{W} \sum_{j=1}^W X(c,i,j) \in \mathbb{R}^{C\times H\times 1}$ 0
$F_h(c,i) = \frac{1}{W} \sum_{j=1}^W X(c,i,j) \in \mathbb{R}^{C\times H\times 1}$ 1

Final recalibration is performed as:

$F_h(c,i) = \frac{1}{W} \sum_{j=1}^W X(c,i,j) \in \mathbb{R}^{C\times H\times 1}$ 2

This operation enables each feature activation to be modulated by its vertical and horizontal position, efficiently instilling spatial awareness with lightweight computation (Hou et al., 2021).

2. Architectural Variants and Extensions

Coordinate Attention has served as the basis for a spectrum of architectural variants.

Grouped Coordinate Attention (GCA): Extends CA by splitting channels into groups ( $F_h(c,i) = \frac{1}{W} \sum_{j=1}^W X(c,i,j) \in \mathbb{R}^{C\times H\times 1}$ 3), allowing for independent attention calculation per group, which is beneficial in settings with strong channel-wise semantic heterogeneity (Ding et al., 30 Dec 2025, Ding et al., 18 Nov 2025). In GCA, both average- and max-pooling are used along each axis, and group-local attention is fused, yielding improved boundary delineation and minor computational cost increase.
Spatial and Spectral Coordinate Attention (MHIF): In the context of multi-modal or spectral data fusion (e.g., multispectral/hyperspectral image fusion), CoFusion (Li, 12 Apr 2026) generalizes the CA idea into spatial (SpaCAM: multi-dilation, spatially softmax-gated mixing) and spectral (SpeCAM: frequency decomposition, coordinate mixing and Top–K selection in channel space) axes, enabling cross-modal, cross-scale collaborative representation.
Integration with ASPP and UNet: CA modules have been paired with ASPP and integrated throughout encoder, bottleneck, and decoder stages in modern UNet-like architectures to refine hierarchical representations at different scales (Wang et al., 2024).

CA can be distinguished from other widely adopted attention designs as follows:

Mechanism	Core Operation	Spatial Encoding	Cost (vs CA)
SE	2D global pooling + FC	None (spatial pooled)	Lower params, less context
CBAM	Channel: SE; Spatial: 7×7 conv	Local (spatial branch), no long-range	Similar params, local context only
CA	1D pooling on H/W + shared transform	Directional/global	Slight MAdds/parameter increase
GCA	CA in groups, pooling (avg/max), expanded context	Enhanced, group- and axis-aware	Slightly higher, scales with G, r

Coordinate Attention achieves higher image classification, object detection, and segmentation performance than SE and CBAM, especially in dense prediction tasks, while still maintaining efficiency suitable for mobile architectures (Hou et al., 2021).

4. Empirical Performance and Computational Profile

Coordinate Attention and its variants deliver consistent improvements across diverse tasks with marginal overhead:

ImageNet (MobileNetV2-1.0×): +2.0% top-1 accuracy (from 72.3% baseline to 74.3% with CA), with only +0.45M params and +10M MAdds (Hou et al., 2021).
Object Detection (COCO, SSDLite320): AP increases from 22.3 baseline to 24.5 with CA, surpassing SE and CBAM.
Semantic Segmentation (DeepLabV3, Pascal VOC): mIoU improves from 70.84% to 73.32% (stride 16); larger gains in high-resolution settings; Cityscapes stride8: 71.4% baseline to 74.0%.
Medical Image Segmentation: GCA-ResUNet achieves 86.11% Dice on Synapse (vs. 77.61% no attention; Swin-UNet: 79.13%) and 92.64% Dice on ACDC (vs. 89.68% U-Net, 90.00% Swin-UNet) with computational overhead of ~2% in parameters and ~0.3% in runtime (Ding et al., 30 Dec 2025, Ding et al., 18 Nov 2025).
Dense/small structure segmentation: GCA improves boundary recovery and small target delineation, especially where channel groups correspond to different organ/semantic classes.

CA and GCA achieve substantially higher empirical segmentation scores with minimal increases in resource requirements compared to self-attention and pure transformer modules.

5. Implementation in Canonical Architectures

The modularity of CA enables straightforward integration into mobile and high-resolution CNNs:

MobileNetV2/MobileNeXt/EfficientNet: CA is inserted at the end of residual blocks (replacing or augmenting SE).
UNet derivatives (Encoder/Decoder): CA is placed after each convolutional block, at the ASPP bottleneck, and at decoder stages; GCA is inserted into all ResNet-50 bottlenecks in GCA-ResUNet.
Medical image segmentation pipelines: CA operates pre- and post-skip connections, post-ASPP fusion, and facilitates sharper edge delineation (Wang et al., 2024).
Spectral/Multimodal fusion: Custom coordinate-aware modules (SpaCAM, SpeCAM) operate at each scale and domain, with SSCFM performing spatial-spectral alignment (Li, 12 Apr 2026).

CA is valued for its “plug-and-play” character, ease of parameter configuration, and transferability across tasks.

6. Practical Advantages and Limitations

Coordinate Attention offers several distinct advantages:

Direction-aware global context: Captures height- and width-wise dependencies while preserving spatial location.
Parameter/computational efficiency: Typically incurs only minor resource increase over SE and CBAM; orders-of-magnitude less than self-attention or transformer-based blocks.
Improvement in dense tasks: Consistently outperforms baseline and SE/CBAM-equipped models in object detection, semantic segmentation, and medical image analysis, particularly for multi-object and small region prediction.
Flexibility and extensibility: The grouping strategy in GCA supports specialization for heterogeneous features; variants support diverse contexts (e.g., spectral axes).

A plausible implication is that in domains with strong semantic channel clustering or cross-modal dependencies, grouped and multi-branch coordinate attention can approach the global modeling capacity of transformers with far less computation. However, the reliance on axis-specific pooling could, in principle, miss interactions between more complex spatial regions not aligned with axes; hybrid or hierarchical extensions may address this.

7. Recent Applications and Research Directions

Post-2021 work has actively extended CA:

Medical image segmentation: GCA-ResUNet establishes state-of-the-art performance on multi-organ CT/MRI and skin lesion datasets with minimal cost (Ding et al., 30 Dec 2025, Ding et al., 18 Nov 2025).
Multispectral/hyperspectral fusion: CoFusion exploits spatial and spectral coordinate-aware modules for superior detail recovery and spectral fidelity (Li, 12 Apr 2026).
Hybrid modules: Pairing CA/GCA with ASPP, attention UNet, and cross-modal fusion networks leverages their complementary strengths (Wang et al., 2024).
Effectiveness on heterogeneous data: Evidence supports use of grouped attention when targets/features exhibit divergent semantics, with grouping and pooling choices subject to ablation and optimization (Ding et al., 30 Dec 2025).

Further research continues on balancing group count and reduction ratio, expanding coordinate ideas into temporal, spectral, and cross-modal domains, and optimizing hardware efficiency for deployment in resource-limited settings.

References:

"Coordinate Attention for Efficient Mobile Network Design" (Hou et al., 2021)
"Improved Unet model for brain tumor image segmentation based on ASPP-coordinate attention mechanism" (Wang et al., 2024)
"GCA-ResUNet: Medical Image Segmentation Using Grouped Coordinate Attention" (Ding et al., 30 Dec 2025)
"GCA-ResUNet:Image segmentation in medical images using grouped coordinate attention" (Ding et al., 18 Nov 2025)
"CoFusion: Multispectral and Hyperspectral Image Fusion via Spectral Coordinate Attention" (Li, 12 Apr 2026)