Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Patchwise Axial Self-Attention

Updated 30 April 2026
  • Patchwise axial self-attention is a method that factorizes global self-attention into efficient patch-based and axial operations for high-dimensional vision tasks.
  • It partitions data into non-overlapping patches to perform local attention before aggregating global context, balancing fine detail with long-range dependencies.
  • Architectures like the Axial Transformer, AEWin Transformer, and GASA-UNet leverage this approach to reduce computational cost while maintaining near-global receptive fields.

Patchwise axial self-attention is a family of self-attention mechanisms that factorize global attention into more memory- and computation-efficient operations by performing attention along single axes of high-dimensional data or within and across spatial patches. This structure enables context modeling in vision and volumetric tasks that would be infeasible for standard quadratic self-attention. Architectures including the Axial Transformer, AEWin Transformer, GASA-UNet, and multiscale Self-Attentive Convolutions (MSAC) instantiate various forms of patchwise axial self-attention in both 2D and 3D domains, with specific designs for balancing local detail and global context while controlling cost (Ho et al., 2019, Zhang et al., 2022, Sun et al., 2024, Barkan, 2019).

1. Principles of Axial and Patchwise Axial Self-Attention

Conventional self-attention on a tensor XRN1×N2××Nd×DX \in \mathbb{R}^{N^1 \times N^2 \times \cdots \times N^d \times D} requires O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2) operations and memory, quickly becoming infeasible for large spatial domains. Axial self-attention addresses this by performing attention sequentially along each axis (e.g., rows, columns, depth), reducing the cost to O(dN1+1/d)O(dN^{1+1/d}) for hypercube tensors (Ho et al., 2019). For images, axial attention applies separate attention to each row or column in succession, factorizing global dependence.

Patchwise axial self-attention further reduces cost and increases flexibility by (1) partitioning input data into patches or windows and (2) performing axial or windowed attention within and/or across these regions. Typical variants combine local window attentions with global axial operations or use a two-stage local-global block structure. This is motivated by the need to capture both fine local cues and long-range dependencies without incurring prohibitive compute/memory requirements (Zhang et al., 2022, Ho et al., 2019).

2. Canonical Algorithms and Variants

Several approaches elaborate the concept of patchwise axial self-attention, differing in their partitioning schemes, ordering, and fusion strategies:

  • Two-Stage Patchwise Axial Attention (Ho et al., 2019):

    1. Partitioning: The input is split into non-overlapping g×gg \times g patches.
    2. Intra-Patch (Local) Axial Attention: For each patch, independently apply row- and column-wise self-attention.
    3. Patch Summarization: Each patch is reduced to a vector (e.g., by mean pooling).
    4. Inter-Patch (Global) Axial Attention: Row- and column-wise attention across the grid of patch summaries.
    5. Feedback: Broadcast or add the patch-level global representations back to the fine spatial grid.
  • Axially Expanded Window Attention (AEWin) (Zhang et al., 2022):

    1. Tokens are divided into local, non-overlapping square windows and row/column stripes.
    2. Attention is computed in parallel on each local window (fine granularity) and on horizontal/vertical stripes (coarse granularity), with heads split among these groups.
    3. Outputs from all groups are concatenated and projected.
  • Global Axial Self-Attention for 3D Volumes (Sun et al., 2024):

    1. 3D feature maps are collapsed along axes using 2D convolutions to generate three sets of 1D patch sequences for width, height, and depth.
    2. The concatenated sequence undergoes multi-head self-attention.
    3. Learnable 1D positional embeddings are injected post-attention to retain spatial identity.
    4. Attention-enhanced features are “unflattened” by axis and concatenated, restoring a 3D tensor for fusion.
  • Self-Attentive Convolutions (SAC, MSAC) (Barkan, 2019):

    1. Q/K/V projections become n×mn \times m convolutions.
    2. Sliding window (patchwise) attention is computed locally within each patch.
    3. Multiscale context is captured by running SAC blocks at various patch sizes in parallel, with outputs summed or concatenated.

3. Mathematical Formulation

Patchwise axial self-attention methods employ the standard QKV projection and scaled dot-product, but apply it on limited axes or regions.

Generic Axial Self-Attention (single axis):

Let XRL×DX \in \mathbb{R}^{L \times D} denote a sequence along a given axis.

Q=XWQ,K=XWK,V=XWVQ = X W^Q,\quad K = X W^K,\quad V = X W^V

A=softmax(QKTdk)A = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)

O=AVO = A V

  • Local (intra-patch): For each patch, two rounds per axis (row, col), g×gg \times g attention matrices.

  • Global (inter-patch): Attention on O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)0 patch grid, again per axis.

Total complexity is given by: O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)1 where O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)2 is image width/height, O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)3 is patch size, and O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)4 is channel dimension.

Extract 1D patches along each axis using specialized 2D convolutional projections, concatenate, and apply MHSA:

O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)5

O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)6

O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)7

O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)8 is then rebroadcast into 3D context via broadcasting per axis.

Divide O((i=1dNi)2)O((\prod_{i=1}^d N^i)^2)9 heads into O(dN1+1/d)O(dN^{1+1/d})0 window (local), O(dN1+1/d)O(dN^{1+1/d})1 horizontal, O(dN1+1/d)O(dN^{1+1/d})2 vertical (axial). Attention is computed in three parallel streams and outputs concatenated.

4. Computational Complexity

Patchwise axial self-attention dramatically reduces the quadratic cost of global self-attention on high-dimensional data. Key metrics for major variants include:

Method Time Memory Dominant Matrix Size
Global Self-Attention O(dN1+1/d)O(dN^{1+1/d})3 O(dN1+1/d)O(dN^{1+1/d})4 O(dN1+1/d)O(dN^{1+1/d})5
Axial (Row+Col) O(dN1+1/d)O(dN^{1+1/d})6 O(dN1+1/d)O(dN^{1+1/d})7 O(dN1+1/d)O(dN^{1+1/d})8 per axis
Patchwise Axial (2-stage) O(dN1+1/d)O(dN^{1+1/d})9 g×gg \times g0 or g×gg \times g1 g×gg \times g2 (local), g×gg \times g3 (global)
GASA (3D volume) g×gg \times g4 Each is g×gg \times g5 for g×gg \times g6

These reductions enable use on larger spatial/volumetric data or higher resolution, while still providing global receptive fields through compounding local/global blocks (Ho et al., 2019, Sun et al., 2024, Zhang et al., 2022).

5. Positional Information and Fusing Axes

Axial splitting or patch partitioning loses explicit information about a voxel/pixel’s global position. Patchwise axial attention variants re-inject positional awareness via:

Outputs aggregated from attention heads assigned to different axes or windows are concatenated or summed, followed by a final linear projection or convolution to fuse information (Zhang et al., 2022, Barkan, 2019).

6. Integration and Empirical Performance

Patchwise axial self-attention has been incorporated successfully into both encoder-decoder and hierarchical transformer architectures:

  • GASA-UNet integrates the GASA block between encoder and decoder, concatenating attention-enhanced 3D features with the encoder output (Sun et al., 2024). The GASA-enhanced nnUNet baseline achieves Dice score improvements up to g×gg \times g7 and NSD increase of up to g×gg \times g8 on small or ambiguous structures, while adding only g×gg \times g9M parameters and n×mn \times m0 GFLOPs.
  • AEWin Transformers alternate between attention on windows and axial stripes, embedding this hybrid block into multi-stage architectures with patch merging for hierarchical context (Zhang et al., 2022).
  • Multiscale SAC (MSAC) modules run patchwise attention at different spatial scales in parallel and fuse outputs, supporting flexible backbone integration; however, large-scale benchmarking is not yet reported (Barkan, 2019).

7. Architectural and Hyperparameter Trade-Offs

Design choices critically affect trade-offs between expressiveness, efficiency, and practical feasibility:

  • Patch Size (n×mn \times m1): Smaller n×mn \times m2 increases local operation cost but reduces global (across-patch) cost; larger n×mn \times m3 does the reverse. Optimal n×mn \times m4 often scales as n×mn \times m5, balancing computation.
  • Number of Heads (n×mn \times m6): Increasing heads distributes attention, allowing axes-local specialization at marginally higher cost.
  • Depth and Fusion: Interleaving local and global blocks or stacking more local/global layers can increase expressivity while controlling memory.
  • Positional Embeddings/Bias: Critical for restoring the unique spatial identity lost in axis or patch flattening.

A plausible implication is that architectures employing patchwise axial self-attention achieve near-global receptive field with close-to-linear cost, preserving both local detail and global coherence. This framework supports efficient scaling for high-resolution vision and medical imaging tasks, often outperforming pure local or axial-only baselines in empirical evaluation (Ho et al., 2019, Sun et al., 2024, Zhang et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Patchwise Axial Self-Attention.