Cross-Layer Attention (CLA)

Updated 20 November 2025

Cross-Layer Attention (CLA) is a set of architectural mechanisms that transfer information between network layers by treating layer outputs as tokens, enhancing contextualization.
It incorporates innovations like MRLA, feature pyramids, and plug-in modules to boost performance in vision, language, and audio tasks.
Optimized implementations use linearization, pruning, and KV sharing to reduce computational cost while maintaining improved efficiency and accuracy.

Cross-Layer Attention (CLA) is a class of architectural and algorithmic mechanisms that enable explicit information transfer, adaptation, or sharing between multiple layers in deep neural networks. By treating individual layers, or their intermediate outputs, as dimensionally structured “tokens,” CLA permits network processing steps to operate not only on intra-layer but also on inter-layer dependencies, resulting in richer contextualization, improved efficiency, and enhanced representational power. Modern CLA mechanisms are instantiated in convolutional, transformer, multi-modal, and graph-based architectures, and are widely adopted in domains such as vision, language modelling, speech recognition, and VLMs.

1. Core Principles and Mathematical Formulation

The canonical structure of CLA generalizes the standard self-attention paradigm by letting a given network layer’s representation (query) attend to features extracted from other (typically, preceding or all) layers (keys and values). This generalization is formalized as follows (Fang et al., 2023, Li et al., 9 Mar 2025):

Given layerwise features $\{X^1, X^2, \dots, X^T\}$ , the output at layer $t$ is computed by aggregating context from all previous layers: $Q^t = f_Q^t(X^t), \qquad K^t = \mathrm{Concat}_{s=1}^{t} f_K^s(X^s), \qquad V^t = \mathrm{Concat}_{s=1}^{t} f_V^s(X^s)$

$A^t = \mathrm{softmax}\left(\frac{Q^t (K^t)^\top}{\sqrt{d_k}}\right), \quad O^t = A^t V^t,$

where $f_Q^t$ , $f_K^s$ , and $f_V^s$ are learnable projections. The resulting output $O^t$ integrates information from all earlier layers, weighted by dynamic attention scores.

CLA extends beyond sequence-to-sequence frameworks: in transformers, keys and values may be cached and reused for multiple subsequent layers (Brandon et al., 21 May 2024, Mu et al., 4 Aug 2024), or compressed and selectively recomputed according to redundancy (Li et al., 9 Mar 2025).

CLA also generalizes to structural cases, such as multiplex graphs, where each per-layer edge embedding is concatenated into a sequence and processed by self-attention mechanisms that explicitly learn inter-layer fusions (Sharma et al., 27 Sep 2025).

2. Architectural Innovations and Design Variants

Research in CLA encompasses a variety of module designs, differing in domain, granularity, and optimization strategy.

Multi-Head Recurrent Layer Attention (MRLA): Treats outputs of all preceding layers as "tokens" in a token sequence, allowing each layer’s query to retrieve information from all lower layers by multi-head attention. The lightweight MRLA-light variant approximates recurrent attention via collinearity of queries and linearizes complexity to be scalable for deep nets (Fang et al., 2023).
Cross-Layer Feature Pyramids: In detection and dense prediction, CLA is employed to integrate multi-scale features in one attention step, e.g., by channel-wise (CCA) and spatial-wise (CSA) cross-layer attention modules (Du et al., 29 Jul 2024), and globally via self-attention units that partition concatenated multiscale features for efficient aggregation (Xie et al., 16 Oct 2025).
Plug-in Modules for Image Enhancement: CLA is often instantiated as modules inserted after residual blocks or at specific depths, with selective or adaptive key aggregation schemes, as in Adaptive CLA (ACLA) with learned gating masks and insertion-position search (Wang et al., 2022), or as continuous fusion of attention maps in dynamic scene deblurring (CCLAT) (Hua et al., 2022).
Transformer Compression and Efficiency: In LLMs and vision transformers, CLA underpins cross-layer sharing of K/V caches (reducing per-token memory) (Brandon et al., 21 May 2024), as well as direct approximation or sharing of Q/K computations between adjacent layers, with tiny feedforward alignment and low-rank correction terms (LiSA) (Mu et al., 4 Aug 2024).
Vision-Language and Multimodal Fusion: CLA mechanisms such as Cross-Layer Vision Smoothing (CLVS) maintain attention focus over critical visual tokens throughout the network depth by propagating a vision memory buffer, interpolated with current-layer attention (Zhao et al., 16 Sep 2025). In VLMs, CLA is also explicitly leveraged through stacked cross-attention blocks that jointly match patch and layer features for cross-modal alignment (Wang et al., 31 Jul 2025).
Semantic Calibration in Supervised Transfer: CLA also enables learned, “soft” alignment between arbitrarily indexed layers of teacher and student networks by an attention-style matrix of alignment weights, replacing brittle, hard-coded layer correspondences in knowledge distillation (Chen et al., 2020).

3. Applications Across Domains

CLA mechanisms are fundamental in advancing the state of the art across a spectrum of tasks:

Dense Prediction and Detection

CLA blocks have demonstrated substantial gains in multi-scale object detection (Du et al., 29 Jul 2024, Xie et al., 16 Oct 2025), panoptic segmentation (Chang et al., 2020), and salient object detection (Tang et al., 2020). Notable advantages include one-shot fusion of all pyramid levels, elimination of semantic gaps caused by sequential feature aggregation, and preservation of fine spatial detail crucial for small object reasoning.

Model/Task	Baseline (AP / mAP)	CLA-enhanced (AP / mAP)	Notable Gains
SSD300 VOC	75.5	78.6	+3.1 mAP (Xie et al., 16 Oct 2025)
RetinaNet (VisDrone)	21.0	22.2	+1.2 AP (Du et al., 29 Jul 2024)
Panoptic Segmentation PQ	37.4	38.6	+1.2 PQ (Chang et al., 2020)

Vision-LLMs (VLMs) and LVLMs

CLA enables sustained, semantically consistent vision-language alignments, either by structured vision memory smoothing (Zhao et al., 16 Sep 2025), layer-patch regional alignment (Wang et al., 31 Jul 2025), or efficient cross-modal fusion strategies that prune redundancy (Mu et al., 4 Aug 2024). Reported results include state-of-the-art F1 and accuracy on multi-benchmark evaluations with minimal parameter overhead.

Audio-Visual Speech Recognition

Multi-Layer Cross-Attention (MLCA) modules inserted at multiple depths of audio/visual encoders in AVSR pipelines yield lower character error rates versus single-layer or post-hoc fusion, highlighting CLA’s role in cross-modal representation learning (Wang et al., 7 Jan 2024).

Multilayer Graph Link Prediction

CLA via transformer self-attention over per-layer edge “tokens” greatly improves multiplex link prediction; the Trans-SLE/Trans-GAT CLA achieves consistent macro-F1 improvements and is scalable and architecture-agnostic (Sharma et al., 27 Sep 2025).

Image Restoration and Enhancement

Support for adaptively learning where and how to attend to previous layers delivers significant PSNR gains on superresolution, denoising, demosaicing, and JPEG artifact reduction, without incurring prohibitive computational overhead (Wang et al., 2022, Hua et al., 2022).

4. Computational Complexity, Redundancy, and Efficiency

While full cross-layer attention (in the absence of approximation or pruning) scales quadratically in network depth $T$ ( $O(T^2)$ ), most practical designs employ optimizations:

Linearization via Recurrent Approximations: MRLA-light stores only the previous output and a small set of scaling vectors, yielding $O(T)$ computational and memory cost with minimal performance degradation (Fang et al., 2023).
Layer Pruning using Divergence Measures: Redundancy in learned attention distributions is quantified via Kullback-Leibler divergence, and efficient pruning (e.g., Enhanced Beta Quantile Mapping) skips computation in layers with redundant retrievals, yielding ~30% time savings without statistical tradeoff (Li et al., 9 Mar 2025).
Cross-Layer KV sharing in Transformers: By grouping adjacent layers (“KV-producers”), the number of distinct KV heads cached is halved for $S=2$ sharing, with negligible impact on perplexity. Combination with MQA/GQA further compresses cache requirements ( $2\times$ over MQA alone), pushing out the attainable Pareto frontier (Brandon et al., 21 May 2024).
Shard and Alignment Approximations: Attention computation may be aligned and adjusted across layers using low-rank corrections and head alignment modules (LiSA), compressing Q/K by $6\times$ and reducing total attention calculation by more than 50% without performance loss (Mu et al., 4 Aug 2024).

5. Empirical Impact and Ablation Evidence

Consistent state-of-the-art results are reported across settings, with ablation studies confirming the necessity of cross-layer mechanisms and the specific contributions of their submodules:

CLA consistently outperforms fixed-pair or single-layer baselines in knowledge distillation (+0.5–1.0% top-1 ImageNet) and multi-modal fusion.
Multi-scale object detection accuracy rises substantially, with explicit global context modules responsible for the largest single jump in performance (Xie et al., 16 Oct 2025).
In segmentation and saliency, cross-level (vs. self-) attention boosts completeness, uniformity, and structure recovery in SOD and panoptic PQ (Tang et al., 2020, Chang et al., 2020).
Pruning of redundant CLA modules maintains or improves accuracy, and yields linear acceleration proportional to the pruned component (Li et al., 9 Mar 2025).
Sustained cross-layer attention in VLMs preserves focus on key visual regions, resulting in F1 gains of +3–8% and improved relation/attribute understanding (Zhao et al., 16 Sep 2025).

6. Limitations, Best Practices, and Future Directions

Several limitations and design insights are identified in the CLA literature:

Sharing factors beyond $S=2$ for KV reuse in transformers may degrade perplexity, and misalignment in shared attention weights can harm shallow layer performance (Brandon et al., 21 May 2024, Mu et al., 4 Aug 2024).
Full, dense cross-layer attention incurs quadratic cost; all practical implementations favor linearized or partitioned designs, or adaptive sparsity via Gumbel-softmax, gating, or quantile masking (Wang et al., 2022, Li et al., 9 Mar 2025).
CLA’s modularity enables plug-and-play integration into diverse backbones; best results are typically achieved with (a) selective deployment on non-redundant or semantically dissimilar regions of depth, (b) quantization or low-rank approximations where redundancy is high, and (c) domain-specific layer-pairing or soft alignment with attention for maximum cross-modal/scale synergy (Chen et al., 2020, Mu et al., 4 Aug 2024).
Strong ablations recommend stacking one or two CLA blocks for best accuracy/efficiency trade-off, ordering channel-wise attention before spatial-wise, and using global positional/semantic encoding across layers for stability (Du et al., 29 Jul 2024, Xie et al., 16 Oct 2025).

Potential directions include integration of CLA with quantized or sparse attention for extreme memory-constrained deployment, extension to temporal or dynamic layer structures (e.g., evolving graphs), and calibration for hierarchical or multi-modality fusion beyond current concatenation-based approaches.

References

(Fang et al., 2023) Cross-Layer Retrospective Retrieving via Layer Attention
(Li et al., 9 Mar 2025) Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals
(Brandon et al., 21 May 2024) Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
(Mu et al., 4 Aug 2024) Cross-layer Attention Sharing for LLMs
(Wang et al., 2022) Adaptive Cross-Layer Attention for Image Restoration
(Hua et al., 2022) Dynamic Scene Deblurring Based on Continuous Cross-Layer Attention Transmission
(Sharma et al., 27 Sep 2025) Mind the Links: Cross-Layer Attention for Link Prediction in Multiplex Networks
(Du et al., 29 Jul 2024) Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images
(Xie et al., 16 Oct 2025) Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection
(Chen et al., 2020) Cross-Layer Distillation with Semantic Calibration
(Wang et al., 7 Jan 2024) MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition
(Zhao et al., 16 Sep 2025) Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-LLMs
(Wang et al., 31 Jul 2025) Optimizing Vision-Language Consistency via Cross-Layer Regional Attention Alignment
(Tang et al., 2020) CLASS: Cross-Level Attention and Supervision for Salient Objects Detection
(Chang et al., 2020) EPSNet: Efficient Panoptic Segmentation Network with Cross-layer Attention Fusion
(Huang et al., 2022) Cross-layer Attention Network for Fine-grained Visual Categorization