Criss-Cross Attention (CCA)

Updated 3 March 2026

Criss-Cross Attention (CCA) is an attention mechanism that restricts interactions to row and column axes in structured data, thereby reducing computational and memory overhead.
Its one-pass design projects inputs into query, key, and value spaces to compute attention along criss-cross paths, with recurrent or dense stacking enabling effective global context aggregation.
Empirical results in semantic segmentation and document-level relation extraction demonstrate state-of-the-art performance with significantly lower FLOPs and memory usage compared to full self-attention.

Criss-Cross Attention (CCA) is an attention mechanism designed to efficiently harvest contextual information along criss-cross paths within structured data, such as 2D spatial grids in vision or entity-pair matrices in document-level relation extraction. Distinct from dense self-attention, CCA reduces computational and memory costs by restricting interactions to spatially or semantically structured axes, and—when stacked or applied recurrently—enables effective context aggregation with sub-quadratic complexity.

1. Core Principles of Criss-Cross Attention

CCA operates by allowing each position in a structured data tensor to attend exclusively to elements that share either its row or column. In pixel-wise semantic segmentation, the input feature map $H \in \mathbb{R}^{C \times H \times W}$ is projected into query, key, and value subspaces via 1x1 convolutions ( $Q = W_q * H$ , $K = W_k * H$ , $V = W_v * H$ ). For each position $u=(x,y)$ , CCA computes affinities only among positions lying on the same criss-cross path—namely, those with $x$ fixed or $y$ fixed—yielding a set of $H+W-1$ connections, as opposed to all $N=HW$ possible pairings as in a non-local block (Huang et al., 2018).

In entity-pair-centric tasks, such as document-level relation extraction, the attention is analogously structured over a matrix $M \in \mathbb{R}^{N_e \times N_e \times d}$ , where $(s,o)$ indexes subject and object entities. For each $(s,o)$ , attention is computed only along (i) the entire subject row $(s,*)$ and (ii) the entire object column $(*,o)$ (Zhang et al., 2022).

2. One-Pass Criss-Cross Attention Module

For a single forward pass, CCA proceeds as follows:

Project inputs into $Q,K,V$ feature spaces (commonly $C'=C/2$ for queries and keys).
For each position $u$ or entity pair $(s,o)$ , gather key/value vectors along its row and column (vision: spatial axes; NLP: subject/object axes).
Compute raw attentions as dot-products between the query at $u$ and corresponding keys, add possible position-specific biases (in NLP).
Normalize these affinities via softmax, separately for row and column directions.
Aggregate weighted values from attended positions, summed with the original input as a residual connection.

Formally, in image grids, the output at pixel $u$ is:

$H'_u = H_u + \sum_{i=1}^{H+W-1} A_{i,u} \cdot \Phi_{i,u}$

where $A$ is the normalized attention map and $\Phi$ are the criss-cross path values (Huang et al., 2018).

In entity-pair grids, similar aggregation follows, with additional attention biases to encourage focus on likely related pairs (Zhang et al., 2022).

3. Recurrent or Densely Connected CCA for Global Context

A single CCA module grants each position access to all others in its row and column but not the entire input (except via consecutive axes). To enable full global context, CCA is either:

Recurrent (RCCA): For vision, sequential CCA modules are stacked with weight sharing. After $R=2$ passes, any two positions are connected via two criss-cross hops; i.e., $u \rightarrow (u_x,v_y) \rightarrow v$ , covering all grid pairs at two-hop distance. This extension suffices to propagate full-image dependencies with minimal added cost (Huang et al., 2018).
Densely Connected Stacking: For entity-pair matrices, multiple CCA layers are stacked à la DenseNet, each layer's input formed by concatenating all previous outputs along the channel dimension, thus capturing both single-hop and multi-hop logical chains. Dense connectivity allows direct access to lower-level features, improving multi-hop reasoning and mitigating feature redundancy (Zhang et al., 2022).

A table recapping recursion strategies:

Task Domain	Stacking Method	Context Connectivity	Empirical Layer Count
Vision	Recurrent CCA	Full image in $R=2$	$R=2$ or $3$
NLP (RE)	Dense stacking	Up to $L$ -hop	$L=3$

4. Category/Clustering Regularization Mechanisms

In vision, over-smoothing from context aggregation is counteracted by an auxiliary category consistent loss. This loss, inspired by margin-based discriminative instance embedding, comprises:

Intra-class variance reduction: Penalizes feature norms exceeding margin $\delta_v$ from the per-class mean.
Inter-class separation: Encourages class means to be separated by $>2\delta_d$ .
Regularization: Penalizes large mean feature norms.

The cumulative loss is

$L = L_\text{seg} + \alpha \cdot \ell_\text{var} + \beta \cdot \ell_\text{dis} + \gamma \cdot \ell_\text{reg}$

with typical weights $\alpha = \beta = 1$ , $\gamma = 0.001$ . This regularization yields an observed further increase of +0.7% mIoU on Cityscapes (Huang et al., 2018).

In entity-pair reasoning, a clustering loss, leveraging cosine similarity and supervised “is_related”/“not_related” labels, further structures representations in embedding space (Zhang et al., 2022).

5. Computational Complexity and Efficiency

CCA’s main advantage over non-local or full self-attention mechanisms is sub-quadratic complexity.

Vision: Standard non-local blocks require $O(N^2)$ storage and computation, where $N=H \cdot W$ . A single CCA layer operates on $O(N \cdot (H+W)) \sim O(N \sqrt{N})$ edges.
Entity-Pair Grids: Full self-attention on $N_e^2$ possible relations would entail $O(N_e^4)$ computations. CCA admits only $2N_e$ attention links per cell, yielding $O(N_e^3)$ overall.

Concrete measurements in Cityscapes segmentation (Huang et al., 2018):

Method	Extra FLOPs	Extra GPU Memory	mIoU (val)
Non-local	~108 GFLOPs	~1,411 MB	78.7%
RCCA ( $R=2$ )	~16.5 GFLOPs	~127 MB	80.5%

CCA thus achieves ~85% fewer FLOPs and ~11× less memory, with improved accuracy.

6. Application Domains and Empirical Performance

Semantic Segmentation and Video Segmentation: CCNet, employing RCCA and category consistent loss, achieves state-of-the-art mean Intersection-over-Union (mIoU) across major benchmarks:

Cityscapes: 80.5% (val), 81.9% (test)
ADE20K: 45.76% (val)
LIP: 55.47% (val)
CamVid (3D-RCCA, $R=3$ ): 79.1% (test)
COCO Instance Segmentation: Improves Mask AP by +1.3 points over baseline Mask-RCNN, outperforming comparable non-local alternatives (Huang et al., 2018).

Document-level Relation Extraction: Dense-CCNet, with densely stacked CCA, attains state-of-the-art performance on DocRED, CDR, and GDA by enabling direct reasoning over entity-pair matrices—an inference regime not accessible to mention- or entity-level graphs (Zhang et al., 2022).

CCA’s efficiency and modularity facilitate its deployment in both plug-and-play vision backbones and as core logical reasoners in NLP pipelines, scaling context aggregation without incurring the computational bottleneck of full self-attention.

7. Design Choices and Implementation Considerations

Key implementation details include:

Projection Dimensions: Queries/keys/values are often of reduced dimension (e.g., $C'=C/2$ ), with output channels restored via pointwise convolutions.
Normalization: Layer normalization follows each residual addition; batch normalization is not used within CCA blocks.
Directional Extensions: In document-level RE, four L-shaped aggregation modes (covering row–row, column–column, and mirror paths) ensure comprehensive multi-hop information flow.
Attention Biases: In Dense-CCNet, attention logits are augmented with learned pairwise biases, trained with a binary cross-entropy auxiliary loss.
Multi-hop Layer Count: Empirically, two passes (vision) or three dense layers (NLP) suffice for near-optimal aggregation; deeper stacking does not yield commensurate gains and may risk overfitting.
Activation: Softmax only in attention scoring; optional GeLU/ReLU in the transition modules.

CCA modules are universally compatible with standard backbone architectures in vision (e.g., ResNet) and transformer-style text encoders (e.g., BERT), facilitating wide adoption in dense prediction and structural reasoning tasks (Huang et al., 2018, Zhang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

CCNet: Criss-Cross Attention for Semantic Segmentation (2018)

A Densely Connected Criss-Cross Attention Network for Document-level Relation Extraction (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Criss-Cross Attention (CCA).