Papers
Topics
Authors
Recent
2000 character limit reached

Grouped Coordinate Attention (GCA)

Updated 24 November 2025
  • Grouped Coordinate Attention (GCA) is an attention mechanism that decomposes feature channels into groups and applies directional pooling to achieve efficient global context modeling.
  • GCA integrates with CNN backbones to enhance long-range dependency and boundary segmentation, significantly boosting performance in tasks like medical image segmentation.
  • By using lightweight 1×1 bottlenecks and minimal overhead (under 5% additional FLOPs), GCA approximates global attention while preserving fine spatial details.

Grouped Coordinate Attention (GCA) is an attention mechanism designed to endow convolutional neural networks with efficient long-range dependency modeling by decomposing feature channels into groups, applying directional pooling along spatial axes, and producing global but lightweight attention maps. Originally introduced in the context of medical image segmentation, GCA addresses the limitations of both standard convolution (localized receptive field) and full self-attention (quadratic computational cost), enabling convolutional backbones to approximate global context with negligible parameter and FLOP overhead compared to transformer-based mechanisms (Ding et al., 18 Nov 2025).

1. Motivation and Conceptual Framework

Conventional convolution operations provide localized feature extraction and struggle to encode long-range contextual information essential for tasks such as boundary delineation in medical images. While self-attention mechanisms, as instantiated in Transformers, allow arbitrary global modeling, they exhibit O(H2W2C)O(H^2W^2C) cost for feature tensors of spatial resolution H×WH \times W and channel dimension CC. Existing lightweight attention modules (e.g., SE, CBAM, CoordAtt) are either too aggressive in collapsing spatial inputs or incur non-trivial convolutional overhead to achieve full 2D spatial modeling.

GCA counters these limitations by:

  • Partitioning the channel dimension into GG groups of size Cg=C/GC_g = C/G,
  • Performing coordinate pooling (average and max) independently along height and width within each group,
  • Fusing these axis-pooled descriptors through a two-stage, shared 1×1 bottleneck to produce directional attention maps,
  • Broadcasting these maps to reweight input features within each group, then recombining the results.

This hierarchical and group-wise approach yields efficient, directionally-aware global attention, preserving fine spatial information while incurring under 5% additional FLOPs per block relative to standard convolutions (Ding et al., 18 Nov 2025).

2. Mathematical Formulation and Mechanism

Let XRB×C×H×WX \in \mathbb{R}^{B \times C \times H \times W} denote the input tensor, where BB is batch size, CC channels, and HH, WW spatial dimensions. The steps are:

  1. Channel Grouping: XX is split into GG groups: X=[X1,,XG]X = [X_1, \ldots, X_G], each XgRB×Cg×H×WX_g \in \mathbb{R}^{B \times C_g \times H \times W}.
  2. Directional (Coordinate) Pooling for each group gg:
    • Height:
      • fgh,avg=AvgPoolW(Xg)RB×Cg×H×1f_{g}^{h,\text{avg}} = \text{AvgPool}_W(X_g) \in \mathbb{R}^{B \times C_g \times H \times 1}
      • fgh,max=MaxPoolW(Xg)RB×Cg×H×1f_{g}^{h,\text{max}} = \text{MaxPool}_W(X_g) \in \mathbb{R}^{B \times C_g \times H \times 1}
    • Width:
      • fgw,avg=AvgPoolH(Xg)RB×Cg×1×Wf_{g}^{w,\text{avg}} = \text{AvgPool}_H(X_g) \in \mathbb{R}^{B \times C_g \times 1 \times W}
      • fgw,max=MaxPoolH(Xg)RB×Cg×1×Wf_{g}^{w,\text{max}} = \text{MaxPool}_H(X_g) \in \mathbb{R}^{B \times C_g \times 1 \times W}
    • Descriptors: Fgh=[fgh,avg,fgh,max]RB×2Cg×H×1F_g^h = [f_{g}^{h,\text{avg}}, f_{g}^{h,\text{max}}] \in \mathbb{R}^{B \times 2C_g \times H \times 1}; similarly FgwF_g^w.
  3. Shared Bottleneck and Attention Generation:
    • The concatenated descriptors [Fgh;Fgw][F_g^h;F_g^w] are processed by a two-stage 1×1 convolution bottleneck (reduction ratio rr), BN, ReLU, and Sigmoid.
    • Output is split into AghRB×Cg×H×1A_g^h \in \mathbb{R}^{B \times C_g \times H \times 1} and AgwRB×Cg×1×WA_g^w \in \mathbb{R}^{B \times C_g \times 1 \times W}.
  4. Feature Reweighting and Merge:
    • Yg=XgAghAgwY_g = X_g \odot A_g^h \odot A_g^w.
    • Final output: Y=Concat(Y1,,YG)RB×C×H×WY = \text{Concat}(Y_1, \ldots, Y_G) \in \mathbb{R}^{B \times C \times H \times W}.

3. Integration into Residual Architectures

In the GCA-ResUNet architecture, the GCA module is integrated into the bottleneck block of a ResNet-50 backbone. The sequence is as follows:

  • Standard bottleneck: xx \rightarrow 1×1 Conv (reduction) \rightarrow 3×3 Conv \rightarrow 1×1 Conv (expansion) \rightarrow BN.
  • GCA is applied to the output of this expansion+BN, prior to residual addition.
  • The GCA-reweighted features are then added to the skip (identity) connection and passed through a ReLU activation.

Applying GCA at this location ensures both local (convolutional) and long-range (grouped coordinate attention) dependencies are encoded in each residual unit, prior to merging with the skip connection (Ding et al., 18 Nov 2025).

4. Computational Complexity and Comparison

The parameter and FLOP profile of GCA is:

  • Per group: Two 1×1 convolutions require 3Cg2/r3C_g^2/r parameters.
  • Total: 3C2/r3C^2/r parameters across GG groups.

For typical settings (C=256C=256, H=W=56H=W=56, r=16r=16), this is <<2% extra parameters of the original bottleneck, and <<5% FLOP overhead per block. In comparison:

Layer FLOPs growth Parameters
Self-attention O((HW)2C)O((H W)^2 C) O(C2)O(C^2)
GCA O(C2/r+HC+WC)O(C^2/r + H C + W C) O(C2/r)O(C^2/r)

Self-attention scales quadratically with spatial size, while GCA scales linearly with CC and only adds quadratic terms in CC relative to rr. GCA thus delivers substantial computational savings, particularly for high-resolution inputs.

5. Empirical Results and Ablation Studies

Extensive experimentation on Synapse multi-organ CT and ACDC cardiac MR segmentation benchmarks demonstrates the superiority of GCA-integrated architectures:

Synapse Dataset: Dice Similarity (DSC, %)

Method Avg DSC
R50-U-Net 77.61
Att-U-Net 77.77
Swin-U-Net 79.13
SelfReg-U-Net 80.54
VM-U-Net 81.08
GCA-ResUNet 86.11

ACDC Dataset: Dice Similarity (DSC, %)

Method Avg DSC
U-Net 89.68
Swin-U-Net 90.00
SelfReg-U-Net 91.49
GCA-ResUNet 92.64

Inference times and GPU memory utilization for GCA-ResUNet (<<4 GB at 224×224224\times224) match standard ResNet-Unet pipelines, whereas hybrid transformer-based U-Nets typically require $8$–$12$ GB (Ding et al., 18 Nov 2025).

Ablation studies compared variants of the residual block:

Variant Synapse Avg DSC (%)
Baseline 77.6
+SE Module 79.2
+CBAM 80.1
+GCA 86.1

GCA inserted into every residual block (Layer1–4) provided the strongest performance, with incremental gains also observed for partial placements.

6. Strengths, Limitations, and Prospective Directions

Strengths:

  • Facilitates global context modeling along spatial axes and channels with minimal compute and parameter overhead.
  • Significantly enhances segmentation of boundaries and small anatomical structures.
  • Offers plug-and-play compatibility with existing 2D and 3D residual architectures.

Limitations:

  • Represents a “factorized” approximation that does not encode arbitrary off-axis pairwise relations as in full nonlocal attention.
  • Requires selection of two hyperparameters: GG (number of groups) and rr (reduction ratio).

Potential Extensions:

  • Generalization to 3D volumetric data through pooling along three orthogonal axes.
  • Synergy with lightweight MLP modules for further stepwise global modeling.
  • Application to non-medical image tasks such as object detection and instance segmentation where directional global context is advantageous.

7. Context and Significance

Grouped Coordinate Attention constitutes an efficient mechanism for integrating structured global dependencies into convolutional backbones, achieving state-of-the-art segmentation performance on medical image benchmarks while maintaining computational and memory efficiency. Its low overhead and modularity position it as a practical enhancement for residual network architectures, with implications for a broad range of vision tasks requiring both local and global spatial understanding (Ding et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Grouped Coordinate Attention (GCA).