Attention Sorting in Transformers

Updated 6 February 2026

Attention Sorting is a set of techniques that reorder and select information based on attention weights to improve flow and mitigate biases in Transformer models.
Inference-time sorting reorders documents using aggregated attention mass, restoring short-context accuracy and significantly enhancing QA performance.
Architectural methods like differentiable sorting (e.g., Sinkhorn and sliced ReLU) reduce computational cost while preserving model expressivity and numerical stability.

Attention sorting refers to a family of methods that explicitly reorder, permute, or select information based on (i) the distribution of model attention weights or (ii) learned or computed sorting-like transformations within attention modules. These approaches are designed to optimize information flow, reduce attention redundancy, mitigate biases (e.g., recency or over-smoothness), and/or achieve computational efficiency in Transformer-based architectures. Attention sorting spans a spectrum from inference-time context reordering using observed attention statistics, to architectural modifications where sorting—often relaxed or differentiable—controls the receptive field or sparsifies the attention computation.

1. Mitigating Recency Bias in Long-Context LLMs: Inference-Time Attention Sorting

The "Attention Sorting" method of (Peysakhovich et al., 2023) addresses recency bias in long-context Transformers, i.e., the learned inductive prior of attending most strongly to tokens near the current prediction point. In retrieval-augmented generation (RAG), where a user query is accompanied by $N$ candidate documents, this bias leads to under-attention to relevant but early documents, degrading answer accuracy as context length grows.

The proposed procedure operates at inference time:

Run a partial decode step to extract per-token, per-layer, per-head attention weights $\alpha_{t,i}^{(h,\ell)}$ for all $i$ in the context.
Aggregate these into per-document attention mass

$A_{t,d} = \frac{1}{HL} \sum_{\ell=1}^L \sum_{h=1}^H \sum_{i \in d} \alpha_{t,i}^{(h,\ell)}$

and, if desired, sum over initial decode steps to obtain $A_d$ .

Sort and permute documents in ascending order of $A_d$ , so those receiving more attention move “closer” (i.e., later) to the prediction point.
Repeat for $K$ iterations and generate the answer with the sorted context.

Empirically, 1–2 iterations of attention sorting on open-source Llama-2-based models restores most of the short-context accuracy (e.g., recovering 40–45 percentage points on 30k-token QA benchmarks versus unsorted baselines), while proprietary models show smaller but nontrivial gains. The process is lightweight relative to full re-decoding, and can be integrated with external retrieval-based pipelines using hybrid attention–similarity re-ranking. Notably, the method presupposes chunkable, reorderable context (document granularity), and may be vulnerable to adversarial distractors that artificially receive high attention (Peysakhovich et al., 2023).

2. Architectural Approaches: Differentiable Sorting in Attention

Several research directions pursue differentiable or nearly-differentiable forms of sorting within the attention mechanism, motivated either by memory/computation savings or by the desire to allocate capacity more effectively.

Sparse Sinkhorn Attention

Sparse Sinkhorn Attention (Tay et al., 2020) generalizes attention by introducing a meta-sorting network producing a relaxed permutation (doubly stochastic matrix) over sequence blocks. The input tokens are partitioned into blocks, projected via a block-pooling operator, and passed through a small feedforward network. Sinkhorn balancing transforms the output to a soft permutation, which reorders or "bunches" globally relevant blocks together.

Causal Sinkhorn Balancing allows application in autoregressive (decoder) settings by masking out future tokens. After sorting blocks, either truncated local or quasi-global attention is applied within a small window of blocks, yielding $O(\ell n)$ memory and compute for sequence length $\ell$ (for $n \ll \ell$ active blocks).

Empirical results show that Sinkhorn attention matches or outperforms both vanilla and other efficient-attention baselines across tasks including language modeling, sequence sorting, image generation, and document classification, particularly under memory constraints and long sequence regimes (Tay et al., 2020).

3. Sorting-Based Quasi-Linear Attention Mechanisms

Sliced ReLU Attention

Sliced ReLU attention (Boufadène et al., 12 Dec 2025) departs from the classical softmax by projecting queries and keys onto learned one-dimensional directions, operating on the scalar differences, and applying a ReLU kernel. The resulting attention scores admit a sorting-based prefix-sum trick, allowing the attention to be computed in $\alpha_{t,i}^{(h,\ell)}$ 0 time with $\alpha_{t,i}^{(h,\ell)}$ 1 memory.

In this framework, the convolutional form of the ReLU kernel in 1D can be reduced to cumulative sums on the sorted projected sequence, efficiently implementing global attention. The approach preserves theoretical expressive power—sequence disentangling and universal approximation—previously established for softmax attention, yet it is highly scalable for long contexts (Boufadène et al., 12 Dec 2025).

Small-scale experiments (LRA, CIFAR-10, point clouds) indicate that sliced ReLU attention attains competitive or better accuracy and superior throughput for long sequences compared to softmax.

Sliceformer: Sorting as Attention

Sliceformer (Yuan et al., 2023) replaces multi-head QKV attention with a linear projection of the input, followed by per-feature sorting (permuting each projected “slice”). This produces implicit, sparse, full-rank, and doubly stochastic attention maps—a notable contrast to the low-rank and blurred maps induced by softmax, especially as $\alpha_{t,i}^{(h,\ell)}$ 2.

The computational cost of the slicing–sorting operation is $\alpha_{t,i}^{(h,\ell)}$ 3 for $\alpha_{t,i}^{(h,\ell)}$ 4 output channels, a reduction from the $\alpha_{t,i}^{(h,\ell)}$ 5 cost of standard attention. On discriminative tasks (LRA, image and text classification, and molecular property prediction), Sliceformer achieves comparable or superior accuracy, reduced memory, and faster inference relative to conventional Transformers (Yuan et al., 2023).

The slicing–sorting paradigm is also numerically robust: it sidesteps the over-smoothing and floating-point instability of exponentials/divisions inherent in softmax, empirically demonstrates preserved spectral richness in embeddings, and can flexibly accommodate ascending/descending or interleaved orders to boost diversity.

4. Attention Sorting in Neural Sorting Networks

Generalized neural sorting networks (Kim et al., 2023) address sorting of high-dimensional inputs using permutation-equivariant Transformer encoders. The attention mechanism (sans positional encoding) produces scalar scores per instance, which are fed through a sequence of comparators (classical hardware sorting network structure).

A key technical advance is an error-free, differentiable swap function, operating as the $\alpha_{t,i}^{(h,\ell)}$ 6 in the forward pass and propagating gradients via a stop-gradient-over-soft-swap in the backward pass. This ensures non-decreasing and differentiable behavior with zero softening error even when chaining many swap layers. The resultant models are trained end-to-end to align soft and hard permutations, showing superior sorting accuracy for complex inputs such as multi-digit images, fragment reassembly, and challenging MNIST/SVHN/sequence benchmarks (Kim et al., 2023).

5. Specialized Applications: Attention Sorting for Domain-Specific Sorting Tasks

Few-Shot Spike Sorting

FS-SS (Fang et al., 23 Mar 2025) employs self-attention and multi-scale residual modules for rapid, high-accuracy sorting of neural spike waveforms. Here, "sorting" refers to classification and discrimination of spike shapes/names in few-shot settings. The model’s residual attention blocks and dilated convolutions enable simultaneous capture of global and local signal variation, achieving >99% accuracy even under strong class similarity and reduced training data. Learned attention maps in FS-SS directly reflect physiologically meaningful waveform landmarks.

Ore Sorting with Fused Attention

OreYOLO (Zhen et al., 2024) integrates efficient multi-scale attention (EMA) within a lightweight object detection backbone to enhance mineral ore classification. The EMA modules sort or emphasize spatial/channel components according to cross-spatial, cross-channel interactions. Ablation studies confirm that EMA substantially boosts mean average precision at negligible compute overhead, by focusing the network on diagnostically salient ore features.

6. Practical Trade-Offs and Theoretical Insights

Performance vs. Efficiency: Sorting-based attention modules, whether via meta-sorting nets (Sinkhorn), 1D sorting tricks (sliced ReLU), or pure permutation (Sliceformer), achieve substantial savings in memory/computation, especially for long sequences or edge-device scenarios (Tay et al., 2020, Boufadène et al., 12 Dec 2025, Yuan et al., 2023, Zhen et al., 2024).
Expressivity: Theoretical guarantees for sequence disentangling and contextual universality persist under sorting-based attention, confirming that computational gains need not unduly constrain representational capacity (Boufadène et al., 12 Dec 2025).
Robustness vs. Adversarial Manipulation: Attention sorting procedures may inadvertently prioritize highly salient but irrelevant distractors; hybrid reweighting with external retrievers is recommended for fact-critical tasks (Peysakhovich et al., 2023).
Numerical Stability: Sorting-based mechanisms are immune to exponentiation/division overflow and exhibit more stable singular value spectra compared to softmax attention (Yuan et al., 2023).
Applicability: Chunkable, reorderable context and defined block structures are typically required for direct application of document-level or block-level attention sorting. Adapting these methods to streaming or interleaved input modalities may require additional model modifications.

7. Outlook and Extensions

Attention sorting comprises a versatile set of strategies spanning inference-time reordering, architectural modifications, and end-to-end differentiable networks. These methods have shown significant benefits in improving the efficiency, stability, and accuracy of deep learning models across NLP, vision, structured data, spike waveform analysis, and industrial domain-specific sorting. Ongoing research focuses on hybrid schemes combining attention patterns with learned retrieval, efficient permutation learning with hard/soft/stochastic variants, and extension to non-chunkable free-form inputs.

Key references: (Peysakhovich et al., 2023, Tay et al., 2020, Boufadène et al., 12 Dec 2025, Yuan et al., 2023, Kim et al., 2023, Fang et al., 23 Mar 2025, Zhen et al., 2024).