Papers
Topics
Authors
Recent
2000 character limit reached

Regional Multi-head Self-Attention (R-MSA)

Updated 12 November 2025
  • R-MSA is a self-attention mechanism that decomposes 2D data into orthogonal 1D row and column tokens to achieve near-global context modeling with reduced complexity.
  • It employs Global Axial Segmentation to split feature maps into row and column sequences, enabling efficient multi-head attention with O(H²+W²) computational scaling.
  • Applied in hyperspectral change detection, R-MSA integrates spatial and temporal modules with convolutional fusion to deliver scalable and accurate analysis.

Regional Multi-head Self-Attention (R-MSA), as formalized in the Cross-Region Multi-Head Self-Attention (CR-MSA) mechanism within the GlobalMind architecture, is a memory- and computation-efficient variant of multi-head self-attention. It achieves near-global context modeling on 2D spatial (and spatio-temporal) data by leveraging a Global Axial Segmentation (GAS) scheme to decompose the input into orthogonal 1D sequences of tokens along rows and columns. This dramatically reduces attention complexity, enabling scalable application to large feature maps, as in hyperspectral change detection scenarios (Hu et al., 2023).

1. Motivation and Foundations

Traditional self-attention operations applied to an image feature map F∈RH×W×CF \in \mathbb{R}^{H \times W \times C} require O((HW)2)O((HW)^2) in both memory and compute—prohibitive for moderate to large resolutions. R-MSA addresses this limitation by performing attention along single axes at a time:

  • Global Row Segmentation (GRS) splits FF into HH tokens (each full-width row).
  • Global Column Segmentation (GCS) splits FF into WW tokens (each full-height column).

In GRS, each row is flattened into a token of dimension E=W⋅CE = W \cdot C, yielding X(row)∈RH×(WC)X^{(\mathrm{row})} \in \mathbb{R}^{H \times (W C)}. In GCS, each column forms a token of dimension E=H⋅CE = H \cdot C, yielding X(col)∈RW×(HC)X^{(\mathrm{col})} \in \mathbb{R}^{W \times (H C)}.

By sweeping attention in both row and column directions, the model recovers most global long-range context and reduces overall complexity to O(H2+W2)O(H^2 + W^2).

Segmentation Axis Sequence Length (LL) Token Dim (EE)
Row (GRS) HH Wâ‹…CW \cdot C
Column (GCS) WW Hâ‹…CH \cdot C

2. Multi-Head Attention Within Axial Segments

Once segmented, standard (scaled dot-product) multi-head self-attention is applied independently to each axial sequence. For HH heads, per-head dimensionality dk=E/Hd_k = E/H:

  • Qh=XpWhQQ_h = X_p W^Q_h, Kh=XpWhKK_h = X_p W^K_h, Vh=XpWhVV_h = X_p W^V_h, with projection matrices WhQ,WhK,WhV∈RE×dkW^Q_h, W^K_h, W^V_h \in \mathbb{R}^{E \times d_k}.
  • Attention output: Ah=softmax(QhKhTdk)Vh∈RL×dkA_h = \mathrm{softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_k}}\right) V_h \in \mathbb{R}^{L \times d_k}.
  • Final output per axis: GASAttn(Xp)=[A1;⋯ ;AH]WO∈RL×E\mathrm{GASAttn}(X_p) = [A_1 ; \cdots ; A_H] W^O \in \mathbb{R}^{L \times E} with WO∈R(Hdk)×EW^O \in \mathbb{R}^{(H d_k) \times E}.

The process in LaTeX notation: Xp=splitp(F)X_p = \text{split}_p(F)

Qh=WhQXp,Kh=WhKXp,Vh=WhVXpQ_h = W^Q_h X_p,\quad K_h = W^K_h X_p,\quad V_h = W^V_h X_p

Ah=softmax(QhKhTdk)VhA_h = \mathrm{softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_k}}\right) V_h

GASAttn(Xp)=WO[A1;…;AH]\mathrm{GASAttn}(X_p) = W^O [A_1;\ldots;A_H]

3. Global-M (Spatial) and Global-D (Temporal) Modules

Both modules utilize the GAS+MHSA backbone but differ in sourcing Q,K,VQ,K,V:

3.1 Global-M (Spatial Self-Attention)

  • Input: F∈RH×W×CF \in \mathbb{R}^{H \times W \times C}
  • LayerNorm and three 1×11 \times 1 convolutions yield Q,K,V∈RH×W×CQ,K,V \in \mathbb{R}^{H \times W \times C}.
  • Axial flattening (row or column) forms Xp∈RL×EX_p \in \mathbb{R}^{L \times E}.
  • Apply GASAttn(Xp)\mathrm{GASAttn}(X_p), reshape back to H×W×CH \times W \times C.
  • Fuse attention heads using a 3×33 \times 3 convolution (merging head information spatially).
  • Add input through residual connection; subsequent LayerNorm and FFN (2 ×1×1\times1 \times 1 conv + ReLU + Dropout + residual).

Shape transitions (row axis example):

  1. FF: H×W×CH \times W \times C
  2. Q,K,VQ,K,V: H×W×CH \times W \times C
  3. XpX_p: H×(WC)H \times (W C)
  4. AhA_h: H×dkH \times d_k for each of HH heads
  5. concatenate →\rightarrow H×(Hdk)=H×EH \times (H d_k) = H \times E
  6. reshape →\rightarrow H×W×CH \times W \times C
  7. 3×33 \times 3 conv, residual sum, FFN →\rightarrow H×W×CH \times W \times C

3.2 Global-D (Temporal Interactive Attention)

  • Inputs: Ft1,Ft2∈RH×W×CF_{t_1}, F_{t_2} \in \mathbb{R}^{H \times W \times C}
  • Q=Conv1×1(LN(Ft1))Q = \mathrm{Conv}_{1 \times 1}(\mathrm{LN}(F_{t_1})), K=Conv1×1(LN(Ft2))K = \mathrm{Conv}_{1 \times 1}(\mathrm{LN}(F_{t_2}))
  • V=Conv1×1(∣Ft1−Ft2∣)V = \mathrm{Conv}_{1 \times 1}(|F_{t_1} - F_{t_2}|) (the inter-temporal absolute difference)
  • Axial flattening, then cross-attention in same GAS fashion (but with Q,KQ,K from distinct time steps)
  • Reshape, add ∣Ft1−Ft2∣|F_{t_1} - F_{t_2}| through residual connection, LayerNorm, FFN

3.3 Multi-level Feature Fusion

In the full siamese network, Global-M is applied in each branch at multiple scales, followed by Global-D for cross-temporal fusion. The outputs of several Global-D modules are concatenated and passed to a classifier, frequently a 1×11 \times 1 convolution followed by sigmoid activation, to yield the final change map.

4. Computational Complexity and Memory Analysis

CR-MSA achieves substantial computational savings relative to full 2D self-attention:

Attention Type Complexity Memory (attention map)
Full SA (2D) O((HW)2⋅d)O((HW)^2 \cdot d) (HW)×(HW)(H W) \times (H W)
GAS–Row Only O(H2⋅d)O(H^2 \cdot d) H×HH \times H
GAS–Col Only O(W2⋅d)O(W^2 \cdot d) W×WW \times W
CR-MSA (both axes) O((H2+W2)⋅d)O((H^2 + W^2) \cdot d) H×HH \times H, W×WW \times W

For high-resolution data (H,W≫1H,W \gg 1), O(H2+W2)≪O((HW)2)O(H^2+W^2) \ll O((HW)^2), making CR-MSA feasible for hyperspectral or remote-sensing imagery.

5. Algorithmic Workflow

The following high-level pseudocode summarizes the CR-MSA pipeline and its integration in a change-detection network:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function CR_MSA(F, axis, num_heads, d_k):
    X = LayerNorm(F)
    Q = Conv1x1(X); K = Conv1x1(X); V = Conv1x1(X)
    if axis == 'row':
        L, E = H, W*C
        Qp, Kp, Vp = reshape_rows(Q), reshape_rows(K), reshape_rows(V)
    else:
        L, E = W, H*C
        Qp, Kp, Vp = reshape_cols(Q), reshape_cols(K), reshape_cols(V)
    for h in 1..num_heads:
        Qh = Qp @ WQ[h]; Kh = Kp @ WK[h]; Vh = Vp @ WV[h]
        Ah = softmax(Qh @ Kh^T / sqrt(d_k)) @ Vh
    Zp = concat(A1, ..., AH); Z = Zp @ WO
    Z2 = reshape_back(Z); Z3 = Conv3x3(Z2) + F
    Z4 = LayerNorm(Z3); FF = Conv1x1(ReLU(Conv1x1(Z4)))
    return Z3 + FF

function GlobalM(F, axis):
    return CR_MSA(F, axis, num_heads=H, d_k=C/H)

function GlobalD(F1, F2, axis):
    X1, X2 = LayerNorm(F1), LayerNorm(F2)
    Q, K = Conv1x1(X1), Conv1x1(X2)
    V = Conv1x1(abs(F1-F2))
    return CR_MSA_cross(Q, K, V, axis)

Multi-scale hierarchical features are first computed via convolutional backbone; then, GlobalM and GlobalD are applied per scale, fused, and classified to output the change probability map.

6. Applications and Empirical Outcomes

CR-MSA, instantiated as R-MSA in GlobalMind, enables accurate hyperspectral change detection by efficiently modeling both spatial and temporal dependencies. Extensive experiments across five standard hyperspectral datasets indicate superior accuracy and efficiency relative to state-of-the-art methods, as all claims and outcomes directly trace to the original description (Hu et al., 2023). The architecture supports fine- to coarse-grained change detection by multi-level feature fusion, and the O(H2+W2H^2+W^2) scaling allows application to large-area, high-resolution remote sensing data.

7. Context and Implications

Global Axial Segmentation and R-MSA elegantly generalize axial attention schemes, leveraging sparse yet expressive attention patterns to approximate full global context without prohibitive cost. This approach is especially relevant in hyperspectral remote sensing, but plausibly extends to other structured 2D data where conventional self-attention proves intractable. A plausible implication is that such methods may become canonical in domains with high spatial or spectral dimensionality, where full attention is impractical. Optimal axis selection, aggregation strategies, and the impact of head fusion via convolution remain active research questions, particularly as model sizes grow and the efficiency/accuracy trade-off is key in production-scale systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Regional Multi-head Self-attention (R-MSA).