SortCut Algorithm for Efficient Sparse Attention
- SortCut is an algorithmic enhancement to Sparse Sinkhorn Attention that dynamically selects and truncates sequence blocks for efficient long-sequence modeling.
- It employs a learned sorting network with Sinkhorn normalization to rank block importance, balancing global context with local attention.
- Empirical results show SortCut achieves near-linear time and space complexity while maintaining performance close to full attention with significant memory savings.
SortCut is an algorithmic enhancement to the Sparse Sinkhorn Attention mechanism, designed to enable efficient and scalable self-attention for long sequences by dynamically truncating attention to the most salient regions. Its core principle is the explicit selection, via learned sorting and hard truncation, of a fixed budget of sequence blocks before applying local attention. This process yields near-linear time and space complexity in sequence length, while maintaining empirical performance comparable to full attention on a variety of natural language and sequence modeling tasks (Tay et al., 2020).
1. Conceptual Motivation and Design Principles
Standard Transformer self-attention incurs quadratic memory and computational costs in the sequence length . Local attention reduces this but loses the ability to aggregate global context. Sparse Sinkhorn Attention addresses this by learning a permutation of sequence blocks, allowing local attention to span non-contiguous parts of the original sequence. SortCut introduces an additional compression step: after learning block importance through permutation, it truncates the permuted sequence by retaining only the highest-scoring blocks. This approach exploits input-dependent structure, focusing memory and compute resources on the most relevant content per example.
SortCut is especially effective when applied in encoder architectures, where a single learned sort suffices per layer. For decoder/autoregressive applications, the sorting must be performed at each timestep with special causal constraints, increasing computational overhead.
2. Formal Definition and Mathematical Specification
Consider sequence embeddings , partitioned into non-overlapping blocks of size . A block-pooling operation (typically mean or sum pooling) aggregates each block's embeddings.
A small block-sorting network assigns each block a vector of sorting logits . These logits are perturbed (optionally) with Gumbel noise, scaled by a temperature parameter , then normalized using Sinkhorn iterations to yield a soft permutation matrix .
Block-wise key and value representations , are linearly projected and block-pooled, then permuted:
SortCut truncates the sorted blocks, retaining only the first :
Attention is computed using queries against the unfolded truncated blocks:
where denotes block-sorting followed by unpooling.
3. Algorithmic Workflow and Pseudocode
The SortCut variant of Sparse Sinkhorn Attention involves the following pipeline for each attention head:
- Compute , , by projecting the input .
- Pool and into blocks (, ).
- Obtain sorting logits via the feed-forward network .
- Optionally add Gumbel noise, normalize via Sinkhorn iterations to soft permutation matrix .
- Apply to block-pooled and for permutation.
- Truncate to top- blocks (, ).
- Expand truncated blocks back to token-level representations.
- Compute scaled dot-product attention using and the expanded , yielding outputs .
Block pooling can be adapted (sum, mean) as necessary. For multi-head attention, each head maintains independent sorting networks and budgets.
| Step | Input(s) | Output |
|---|---|---|
| Block pooling | ||
| Sorting logits | ||
| Sinkhorn norm. | , | |
| Permutation | ||
| Truncation |
This structured pipeline underpins SortCut’s efficiency and flexibility.
4. Computational Complexity and Efficiency
For sequence length , block size , number of blocks , and SortCut budget :
- Full attention: time, memory
- Block-local attention: time, memory
- Sinkhorn block-sort attention: Adds Sinkhorn cost, no truncation
- SortCut (with truncation): time, memory
Since typically, SortCut achieves near-linear time and space in , aside from the soft permutation (Sinkhorn) cost—amortized over training. Unlike masking-based sparse architectures, SortCut requires only standard matrix operations, simplifying implementation.
5. Hyperparameterization and Influence
Key SortCut hyperparameters include:
- Block size : Governs granularity; larger yields fewer blocks and cheaper Sinkhorn sorting, at potential loss of block-level detail.
- Sinkhorn temperature and iteration count : Lower and higher make the learned permutation more discrete. However, excessively sharp (hard) sorting impairs gradient flow. Empirical ablative analysis shows and are typically optimal; (no Sinkhorn) severely degrades performance.
- Truncation budget : The number of blocks retained post-sort. Lower saves memory but potentially omits context; in practice, even (on long NLP sequences) can nearly match full attention using $1/32$ of the memory.
- Block sorting network : Its architecture and capacity affect permutation quality.
Ablations confirm that small can suffice and that the balance between Sinkhorn softness and truncation aggressiveness is crucial to optimal trade-offs.
6. Integration with Meta-Sorting Networks and Causal Sinkhorn
The meta-sorting network emits per-block logits guiding the permutation. In encoder architectures, a single sort suffices per layer, maximizing efficiency. For causal (decoder) attention, the block pooling and Sinkhorn normalization must enforce causality:
- Block pooling up to position for block , ensuring no future context access.
- Masked Sinkhorn normalization prohibits permutation entries for indices .
Causal SortCut thus requires recomputation at every decoding step, making it better suited to encoder use cases.
7. Empirical Results and Comparative Analysis
On standard NLP and sequence tasks (IMDb, SST, SNLI, MultiNLI, up to 2k tokens), SortCut encoders with blocks demonstrate accuracy within $1$– of full attention while reducing memory usage by –.
Representative experimental results:
| Model | IMDb Accuracy (%) | SNLI Accuracy (%) | Memory Use |
|---|---|---|---|
| Vanilla Transformer | 85.12 | 78.87 | Baseline (O()) |
| SortCut () | 84.32 | – | |
| SortCut () | 84.43 | – | |
| SortCut () | – | 80.30 | Reduced |
Examinations reveal that Sinkhorn is essential for performance; setting collapses accuracy. Optimal results arise from a soft—but not too hard—sorting regime and moderate block budgets.
Empirically, SortCut sometimes outperforms full attention models on NLI tasks, suggesting that focused content selection can act as a regularizer. The practical utility of hard truncation and learned permutation is consistently confirmed by ablations (Tay et al., 2020).
In summary, SortCut generalizes Sparse Sinkhorn Attention with post-sorting truncation, dynamically selecting the most pertinent subsequence blocks for attention computation. This results in a simple, efficient, and highly scalable mechanism—well-suited for long-sequence modeling without requiring specialized hardware or custom kernels, achieving substantial savings in computational resources while closely matching standard attention's empirical performance.