MARS: Multi-Layer Attention for Robust Selection

Updated 10 January 2026

The paper presents MARS, which amplifies demonstration effectiveness by exploiting multi-layer self-attention and explicit gradient flow in large language models.
It employs a gradient-flow-based selection mechanism (GradS) to robustly distinguish and prioritize novel, relevant demonstrations for improved in-context learning.
Empirical results show that MARS achieves a 6.8% relative improvement over baselines on standard ICL benchmarks across various LLM architectures.

Multi-Layer Attention for Robust Selection (MARS) is a demonstration selection framework for in-context learning (ICL) that exploits the intrinsic property of multi-layer self-attention in LLMs to serve as an amplifier of demonstration effectiveness. MARS leverages explicit gradient flow calculations to identify demonstrations whose information is novel and relevant to the given user query, robustly distinguishing effective examples from ineffective or redundant ones. Empirical and theoretical analyses confirm that the magnitude of this distinction increases across model layers, enabling highly accurate selection of informative demonstrations for improved ICL performance (Wang et al., 1 Aug 2025).

1. Theoretical Basis: Multi-Layer Self-Attention as an Amplifier

MARS is motivated by the observation that multi-layer self-attention networks, when stacked in standard transformer architectures, increasingly sharpen the distinction between effective and ineffective demonstrations. The Linear Self-Attention (LSA) abstraction defines a single layer with the operation

$f_{LSA}(E; \theta) = E + W^{PV} E (E^T W^{KQ} E),$

where $E = [d \mid q] \in \mathbb{R}^{e \times 2}$ concatenates a demonstration $d$ and query $q$ , with attention matrices $W^{PV}, W^{KQ} \in \mathbb{R}^{e \times e}$ . The predicted answer vector $\hat{q}_y$ is the final column of the output.

Within this framework, the gradient flow between a demonstration $d$ and query $q$ is formalized as the partial derivative of $\hat{q}_y$ with respect to $d$ . If this gradient is zero, the demonstration is either irrelevant or not informative for the query, i.e., its information is already learned. When multiple LSA layers are composed, the ratio of gradient flows between effective and ineffective demonstrations increases with depth, a property formalized in Theorem 2 of (Wang et al., 1 Aug 2025): for any two layers $l_1 > l_2$ ,

$\frac{\|\nabla_{d^{(0)}} \hat{q}_y^{(l_1)}(d_1)\|}{\|\nabla_{d^{(0)}} \hat{q}_y^{(l_1)}(d_2)\|} \geq \frac{\|\nabla_{d^{(0)}} \hat{q}_y^{(l_2)}(d_1)\|}{\|\nabla_{d^{(0)}} \hat{q}_y^{(l_2)}(d_2)\|},$

thereby amplifying the signal-to-noise ratio for selection.

2. Formal Definition: Gradient Flow

Gradient flow in this context refers to the sensitivity of the model's prediction with respect to a particular demonstration. For predictions $\hat{q}_y(E; \theta)$ , the gradient flow w.r.t. input slot $x$ is

$\nabla_x \hat{q}_y = \frac{\partial \hat{q}_y}{\partial x},$

quantified by the Frobenius norm $\|\nabla_x \hat{q}_y\|$ . In the LSA model, this reduces to

$\nabla_d \hat{q}_y = (W^{PV} d)_y (q^T W^{KQ})^T + (d^T W^{KQ} q) W^{PV}_y,$

capturing both the new information contributed by $d$ and its alignment with $q$ .

In multi-layer settings, the total gradient is the product of each layer's Jacobian, further sharpening the contrast between demonstrations of varying effectiveness through successive layers.

3. MARS Algorithm: Gradient-Flow Based Demonstration Selection (GradS)

MARS operationalizes gradient flow as the principal selection criterion in the GradS algorithm. The approach involves:

Offline stage: For every candidate demonstration $d_i$ , produce its final-layer embedding $\hat{d}_i$ via a forward pass through the LLM. These embeddings are cached for efficiency.
Online (per-query) stage: For an incoming query $q$ , compute its final-layer embedding $\hat{q}$ and, for each $\hat{d}_i$ , calculate the surrogate gradient-flow score:

$\nabla_{d_i} \hat{q}_y = (W^{PV} \hat{d}_i)_y (\hat{q}^T W^{KQ})^T + (\hat{d}_i^T W^{KQ} \hat{q}) W^{PV}_y.$

The top- $K$ demonstrations by $\|\nabla_{d_i} \hat{q}_y\|$ are prepended to $q$ for final answer generation.

The time complexity is dominated by producing and caching the embeddings ( $O(\sum_i \mathcal{M}_\theta(d_i))$ ) offline, and running a forward pass for selection ( $O(\mathcal{M}_\theta(q+d_q))$ ) online.

Stage	Computation	Complexity
Offline	Encode and cache $\hat{d}_i$ for all $d_i$	$O(\sum_i \mathcal{M}_\theta(d_i))$
Online	Encode $\hat{q}$ , compute scores, select top- $K$	$O(\mathcal{M}_\theta(q) + n e^2)$

4. Robustness via Multi-Layer Amplification

MARS achieves robustness in selection by relying on the multi-layer attention property that amplifies the differential contribution of demonstrations. Effective demonstrations—those introducing both new and relevant information—achieve dramatically increased gradient-flow scores in the upper layers, as confirmed by empirical evaluation. Thus, MARS's selection criterion is insensitive to spurious features or query-irrelevant information, avoiding ineffective in-context examples without explicit external supervision.

5. Empirical Validation and Benchmarking

Experiments on Llama2-7b, Llama3.1-8b, Deepseek-R1-8b, and Qwen3-8b (including Long-CoT variants) were performed on five standard ICL benchmarks: GSM8K, MATH, ARC-Challenge, MMLU-Pro, and Amazon Review (Wang et al., 1 Aug 2025). MARS was compared with BM25, cosine-similarity, maximal marginal relevance (MMR), and Mixture of Demonstrations (MoD) baselines across both one-shot (for ablations) and three-shot (final evaluation) settings.

Key findings include:

The gap between average gradient flows of effective and ineffective demonstrations increases up to an order of magnitude from early to final layers (see Figure 1 in (Wang et al., 1 Aug 2025)).
MARS (via GradS) produces an average relative improvement of 6.8% over the strongest baseline across models and datasets, with especially strong gains on knowledge-intensive tasks such as MATH and MMLU-Pro.

MARS addresses a key limitation of existing demonstration selection methods, which prioritize query relevance but may fail to exclude demonstrations whose information has already been assimilated by the model. By directly measuring the impact of a demonstration through its gradient flow at the network’s endpoint, MARS guarantees that only examples yielding new, query-relevant information are used. This approach avoids shortcut learning, spurious correlations, and redundancy—core issues in both synthetic (e.g., reasoning) and real-world (e.g., retrieval) domains. Empirical support for multi-layer amplification further establishes a foundation for future robust selection algorithms in the broader ICL landscape (Wang et al., 1 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Multi-Layer Attention is the Amplifier of Demonstration Effectiveness (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Layer Attention for Robust Selection (MARS).