Papers
Topics
Authors
Recent
Search
2000 character limit reached

SortCut Algorithm for Efficient Sparse Attention

Updated 7 February 2026
  • SortCut is an algorithmic enhancement to Sparse Sinkhorn Attention that dynamically selects and truncates sequence blocks for efficient long-sequence modeling.
  • It employs a learned sorting network with Sinkhorn normalization to rank block importance, balancing global context with local attention.
  • Empirical results show SortCut achieves near-linear time and space complexity while maintaining performance close to full attention with significant memory savings.

SortCut is an algorithmic enhancement to the Sparse Sinkhorn Attention mechanism, designed to enable efficient and scalable self-attention for long sequences by dynamically truncating attention to the most salient regions. Its core principle is the explicit selection, via learned sorting and hard truncation, of a fixed budget of sequence blocks before applying local attention. This process yields near-linear time and space complexity in sequence length, while maintaining empirical performance comparable to full attention on a variety of natural language and sequence modeling tasks (Tay et al., 2020).

1. Conceptual Motivation and Design Principles

Standard Transformer self-attention incurs quadratic memory and computational costs in the sequence length â„“\ell. Local attention reduces this but loses the ability to aggregate global context. Sparse Sinkhorn Attention addresses this by learning a permutation of sequence blocks, allowing local attention to span non-contiguous parts of the original sequence. SortCut introduces an additional compression step: after learning block importance through permutation, it truncates the permuted sequence by retaining only the highest-scoring nn blocks. This approach exploits input-dependent structure, focusing memory and compute resources on the most relevant content per example.

SortCut is especially effective when applied in encoder architectures, where a single learned sort suffices per layer. For decoder/autoregressive applications, the sorting must be performed at each timestep with special causal constraints, increasing computational overhead.

2. Formal Definition and Mathematical Specification

Consider sequence embeddings X∈Rℓ×dX \in \mathbb{R}^{\ell \times d}, partitioned into NB=ℓ/BN_B = \ell / B non-overlapping blocks of size BB. A block-pooling operation ψP(X)=B(X)∈RNB×d\psi_P(X) = B(X) \in \mathbb{R}^{N_B \times d} (typically mean or sum pooling) aggregates each block's embeddings.

A small block-sorting network P(⋅):Rd→RNBP(\cdot): \mathbb{R}^d \rightarrow \mathbb{R}^{N_B} assigns each block a vector of sorting logits Rlogits∈RNB×NBR_{\text{logits}} \in \mathbb{R}^{N_B \times N_B}. These logits are perturbed (optionally) with Gumbel noise, scaled by a temperature parameter τ\tau, then normalized using NiterN_{\text{iter}} Sinkhorn iterations to yield a soft permutation matrix RR.

Block-wise key and value representations KBK_B, VBV_B are linearly projected and block-pooled, then permuted:

KS=RKB,VS=RVBK_S = R K_B, \qquad V_S = R V_B

SortCut truncates the sorted blocks, retaining only the first nn:

Kcut=KS[1:n,:],Vcut=VS[1:n,:]K_{\text{cut}} = K_S[1:n, :], \qquad V_{\text{cut}} = V_S[1:n, :]

Attention is computed using queries QQ against the unfolded truncated blocks:

Y=Softmax(Q[ψS(K)]:n⊤) [ψS(V)]:nY = \mathrm{Softmax}\left(Q [\psi_S(K)]_{:n}^\top \right) \ [\psi_S(V)]_{:n}

where ψS(X)=U(R⋅B(X))\psi_S(X) = U(R \cdot B(X)) denotes block-sorting followed by unpooling.

3. Algorithmic Workflow and Pseudocode

The SortCut variant of Sparse Sinkhorn Attention involves the following pipeline for each attention head:

  1. Compute QQ, KK, VV by projecting the input XX.
  2. Pool KK and VV into blocks (KBK_B, VBV_B).
  3. Obtain sorting logits via the feed-forward network P(â‹…)P(\cdot).
  4. Optionally add Gumbel noise, normalize via Sinkhorn iterations to soft permutation matrix RR.
  5. Apply RR to block-pooled KBK_B and VBV_B for permutation.
  6. Truncate to top-nn blocks (KcutK_{\text{cut}}, VcutV_{\text{cut}}).
  7. Expand truncated blocks back to token-level representations.
  8. Compute scaled dot-product attention using QQ and the expanded KcutK_{\text{cut}}, yielding outputs YY.

Block pooling can be adapted (sum, mean) as necessary. For multi-head attention, each head maintains independent sorting networks and budgets.

Step Input(s) Output
Block pooling K,VK, V KB,VBK_B, V_B
Sorting logits KBK_B RlogitsR_{\text{logits}}
Sinkhorn norm. RlogitsR_{\text{logits}}, Ï„\tau RR
Permutation KB,VB,RK_B, V_B, R KS,VSK_S, V_S
Truncation KS,VS,nK_S, V_S, n Kcut,VcutK_{\text{cut}}, V_{\text{cut}}

This structured pipeline underpins SortCut’s efficiency and flexibility.

4. Computational Complexity and Efficiency

For sequence length â„“\ell, block size BB, number of blocks NB=â„“/BN_B = \ell / B, and SortCut budget nn:

  • Full attention: O(â„“2d)O(\ell^2 d) time, O(â„“2)O(\ell^2) memory
  • Block-local attention: O(B2d+(â„“/B)2d)O(B^2 d + (\ell/B)^2 d) time, O(B2+(â„“/B)2)O(B^2 + (\ell/B)^2) memory
  • Sinkhorn block-sort attention: Adds O(NiterNB2)O(N_{\text{iter}} N_B^2) Sinkhorn cost, no truncation
  • SortCut (with truncation): O(â„“nd+NiterNB2)O(\ell n d + N_{\text{iter}} N_B^2) time, O(â„“n+NB2)O(\ell n + N_B^2) memory

Since n≪NB≪ℓn \ll N_B \ll \ell typically, SortCut achieves near-linear time and space in ℓ\ell, aside from the soft permutation (Sinkhorn) cost—amortized over training. Unlike masking-based sparse architectures, SortCut requires only standard matrix operations, simplifying implementation.

5. Hyperparameterization and Influence

Key SortCut hyperparameters include:

  • Block size BB: Governs granularity; larger BB yields fewer blocks and cheaper Sinkhorn sorting, at potential loss of block-level detail.
  • Sinkhorn temperature Ï„\tau and iteration count NiterN_{\text{iter}}: Lower Ï„\tau and higher NiterN_{\text{iter}} make the learned permutation more discrete. However, excessively sharp (hard) sorting impairs gradient flow. Empirical ablative analysis shows τ∈[0.5,0.75]\tau \in [0.5, 0.75] and 5≤Niter≤105 \leq N_{\text{iter}} \leq 10 are typically optimal; Niter=0N_{\text{iter}} = 0 (no Sinkhorn) severely degrades performance.
  • Truncation budget nn: The number of blocks retained post-sort. Lower nn saves memory but potentially omits context; in practice, even n=8n=8 (on long NLP sequences) can nearly match full attention using $1/32$ of the memory.
  • Block sorting network P(â‹…)P(\cdot): Its architecture and capacity affect permutation quality.

Ablations confirm that small nn can suffice and that the balance between Sinkhorn softness and truncation aggressiveness is crucial to optimal trade-offs.

6. Integration with Meta-Sorting Networks and Causal Sinkhorn

The meta-sorting network P(â‹…)P(\cdot) emits per-block logits guiding the permutation. In encoder architectures, a single sort suffices per layer, maximizing efficiency. For causal (decoder) attention, the block pooling and Sinkhorn normalization must enforce causality:

  • Block pooling up to position ii for block ii, ensuring no future context access.
  • Masked Sinkhorn normalization prohibits permutation entries for indices j>ij > i.

Causal SortCut thus requires recomputation at every decoding step, making it better suited to encoder use cases.

7. Empirical Results and Comparative Analysis

On standard NLP and sequence tasks (IMDb, SST, SNLI, MultiNLI, up to 2k tokens), SortCut encoders with n=8n=8 blocks demonstrate accuracy within $1$–2%2\% of full attention while reducing memory usage by 30×30\times–50×50\times.

Representative experimental results:

Model IMDb Accuracy (%) SNLI Accuracy (%) Memory Use
Vanilla Transformer 85.12 78.87 Baseline (O(â„“2\ell^2))
SortCut (B=128,n=8B=128, n=8) 84.32 – ∼1/32\sim1/32
SortCut (B=128,n=32B=128, n=32) 84.43 – ∼1/8\sim1/8
SortCut (n=16n=16) – 80.30 Reduced

Examinations reveal that Sinkhorn is essential for performance; setting Niter=0N_{\text{iter}} = 0 collapses accuracy. Optimal results arise from a soft—but not too hard—sorting regime and moderate block budgets.

Empirically, SortCut sometimes outperforms full attention models on NLI tasks, suggesting that focused content selection can act as a regularizer. The practical utility of hard truncation and learned permutation is consistently confirmed by ablations (Tay et al., 2020).


In summary, SortCut generalizes Sparse Sinkhorn Attention with post-sorting truncation, dynamically selecting the most pertinent subsequence blocks for attention computation. This results in a simple, efficient, and highly scalable mechanism—well-suited for long-sequence modeling without requiring specialized hardware or custom kernels, achieving substantial savings in computational resources while closely matching standard attention's empirical performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SortCut Algorithm.