Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparsefinder: Efficient Sparse Entmax

Updated 12 June 2026
  • Sparsefinder is a method that predicts the exact sparsity pattern of entmax-based attention, reducing dense matrix computation by focusing on nonzero entries.
  • It uses low-dimensional projections with contrastive hinge loss and grouping strategies (distance, quantization, clustering) to efficiently approximate attention edges.
  • Empirical evaluations in machine translation and language modeling show that Sparsefinder achieves near full recall with significant reductions in FLOPs and memory usage.

Sparsefinder is a method for efficiently predicting the exact sparsity pattern of entmax-based attention in transformer models, allowing one to compute only those attention edges that will receive nonzero probability, with the goal of significant reductions in both computation and memory. Originally introduced by Treviso et al. (2021), Sparsefinder addresses the challenge that, while entmax produces sparse attention distributions, it still requires forming and normalizing a dense attention matrix with O(nm)O(nm) complexity, where nn and mm are the sequence lengths. Sparsefinder predicts the sparse attention graph prior to the actual attention computation, enabling focused computation on attended positions and achieving improved efficiency without sacrificing the accuracy or recall of the attention mechanism (Treviso et al., 2021).

1. Mathematical Foundations of Entmax Attention

Sparsefinder builds directly on entmaxα_\alpha attention mechanisms. Whereas standard Transformer attention computes Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}, with queries QRn×dQ \in \mathbb{R}^{n\times d} and keys KRm×dK \in \mathbb{R}^{m\times d}, then applies softmax normalization giving dense distributions, entmaxα_\alpha replaces softmax with a Tsallis-type entropy regularizer. The entmaxα_\alpha function is defined by

entmaxα(z)=argmaxpΔm(pzHα(p))\text{entmax}_\alpha(z) = \arg\max_{p\in\Delta^m} \left( p^\top z - H_\alpha(p) \right)

where nn0 for nn1 (recovering softmax), and for nn2,

nn3

For nn4, the attention distribution is sparse: coordinates with nn5 are exactly zero, and nonzero values have the closed-form

nn6

Popular choices include nn7 (sparsemax) and nn8 (typical in this work for a trade-off between sparsity and ease of computation).

The "attention graph" for each head is

nn9

which is typically much sparser than the dense mm0 matrix. Due to the "sparse-consistency" property of entmax, if a mask mm1 is predicted and the attention computation is restricted to this mask, the result is exactly equal on all truly nonzero entries.

2. Sparsefinder Model Architecture and Variants

Sparsefinder operates in two key stages:

(A) Learning Low-Dimensional Projections:

For every head mm2, learn linear projections mm3 (with mm4) and compute projected queries and keys mm5, mm6. Training is by a contrastive hinge loss:

mm7

where mm8 is a margin. The result is a space in which true attended pairs are close.

(B) Edge Prediction via Grouping Strategies:

  • Distance-Based Sparsefinder:

Predict mm9 as an edge if α_\alpha0 for threshold α_\alpha1. All α_\alpha2 distances are computed, trading off sparsity and recall by tuning α_\alpha3.

  • Quantization-Based Sparsefinder:

Each of α_\alpha4 dimensions is partitioned into α_\alpha5 quantile bins. Tokens sharing a bin in at least one dimension are assumed adjacent. No additional learning is needed beyond projection. Edges scale as α_\alpha6.

  • Clustering-Based Sparsefinder:

Offline α_\alpha7-means clustering is run on projected vectors, yielding α_\alpha8 centroids. Each α_\alpha9, Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}0 is assigned to its Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}1 nearest centroids. Edge Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}2 is predicted present if assigned clusters intersect. The parameter Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}3 controls sparsity/recall.

Sparsefinder can be combined with fixed local patterns (e.g., sliding windows of width Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}4) and global tokens as used in architectures such as Longformer or BigBird.

3. Computational and Memory Complexity

The computational improvements arise from predicting Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}5 so that full Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}6 need not be formed.

Approach FLOPs (score computation) Memory (mask)
Dense entmaxZ=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}7 Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}8 Z=QKRn×mZ = QK^\top \in \mathbb{R}^{n\times m}9
Sparsefinder-projection QRn×dQ \in \mathbb{R}^{n\times d}0 QRn×dQ \in \mathbb{R}^{n\times d}1
Sparsefinder-mask (dist.) QRn×dQ \in \mathbb{R}^{n\times d}2 QRn×dQ \in \mathbb{R}^{n\times d}3
Sparsefinder-dot products QRn×dQ \in \mathbb{R}^{n\times d}4 (QRn×dQ \in \mathbb{R}^{n\times d}5) --

For quantization, bucket assignment cost is QRn×dQ \in \mathbb{R}^{n\times d}6 and candidate edges QRn×dQ \in \mathbb{R}^{n\times d}7. Clustering-based mask generation is QRn×dQ \in \mathbb{R}^{n\times d}8 and candidate edges scale as QRn×dQ \in \mathbb{R}^{n\times d}9 for balanced clusters with KRm×dK \in \mathbb{R}^{m\times d}0 assignments per token. If KRm×dK \in \mathbb{R}^{m\times d}1 and KRm×dK \in \mathbb{R}^{m\times d}2, substantial savings are obtained relative to the standard KRm×dK \in \mathbb{R}^{m\times d}3 complexity.

4. Empirical Evaluation in Machine Translation and Language Modeling

Sparsefinder was evaluated in both machine translation (MT) and masked language modeling (MLM) settings:

MT Experiments:

A "transformer-large" (6 layers, 16 heads, KRm×dK \in \mathbb{R}^{m\times d}4) model was pretrained on Paracrawl and fine-tuned with KRm×dK \in \mathbb{R}^{m\times d}5-entmax (KRm×dK \in \mathbb{R}^{m\times d}6) on IWSLT'17 ENKRm×dK \in \mathbb{R}^{m\times d}7DE and ENKRm×dK \in \mathbb{R}^{m\times d}8FR, with BPE vocab size 32k. Projections from KRm×dK \in \mathbb{R}^{m\times d}9 to α_\alpha0 dimensions were learned per head, using one epoch of margin-hinge loss on 10k long sentences (α_\alpha18--9M positive pairs/layer). Hyperparameters were swept over distance thresholds α_\alpha2, bucket counts α_\alpha3, and window sizes α_\alpha4.

  • Pareto frontiers of sparsity vs. recall showed that distance-based and clustering-based Sparsefinder variants strictly dominate baselines (Longformer, BigBird, Reformer, Routing Transformer, local window) for any given sparsity
  • For both ENα_\alpha5DE and ENα_\alpha6FR, distance and clustering recover nearly full BLEU at the sparsity of the dense entmax Transformer, e.g., BLEU 34.47 (ENα_\alpha7DE), 42.65 (ENα_\alpha8FR)

MLM Experiments:

RoBERTa-base (α_\alpha9, α_\alpha0) was fine-tuned on WikiText-103 for 3k steps with α_\alpha1 entmax (α_\alpha2, overall sparsity α_\alpha3). Projection from α_\alpha4 to α_\alpha5 was trained, and grouping parameters swept.

  • Distance and clustering Sparsefinder achieved strictly better recall or perplexity for all sparsity levels vs. Longformer, BigBird, Reformer, Routing Transformer, and local baselines

All experiments ran on NVIDIA Titan Xp/GTX 1080Ti/RTX 2080 Ti GPUs and AMD/Intel CPUs; projection learning used 128 positive/negative pairs, Adam LR=α_\alpha6, margin α_\alpha7, for one epoch.

5. Design Trade-Offs and Practical Considerations

The primary controlling parameters are the distance threshold α_\alpha8 (distance-based) and cluster-assignment α_\alpha9 (clustering-based), directly trading sparsity for recall: lower values increase sparsity while reducing recall; higher values increase recall at the expense of more computation. Quantization-based grouping, while algorithmically simplest, underperforms: it cannot adapt to irregular neighborhoods in the projection space.

Clustering achieves performance near that of distance-based grouping but at lower computational cost, since pairwise distances are not required. For balanced clusters, entmaxα(z)=argmaxpΔm(pzHα(p))\text{entmax}_\alpha(z) = \arg\max_{p\in\Delta^m} \left( p^\top z - H_\alpha(p) \right)0 and entmaxα(z)=argmaxpΔm(pzHα(p))\text{entmax}_\alpha(z) = \arg\max_{p\in\Delta^m} \left( p^\top z - H_\alpha(p) \right)1 provide entmaxα(z)=argmaxpΔm(pzHα(p))\text{entmax}_\alpha(z) = \arg\max_{p\in\Delta^m} \left( p^\top z - H_\alpha(p) \right)2 mask generation and high recall. Distance-based grouping remains an upper bound on the attainable trade-off, and is computationally efficient when entmaxα(z)=argmaxpΔm(pzHα(p))\text{entmax}_\alpha(z) = \arg\max_{p\in\Delta^m} \left( p^\top z - H_\alpha(p) \right)3.

In both MT and MLM tasks, Sparsefinder’s distance and clustering variants define new Pareto fronts in sparsity–recall and task performance, outperforming established fixed and learned pattern baselines. The evidence indicates that efficient prediction of content-driven sparse patterns is not only feasible but more accurate than prior fixed or dynamically-parameterized sparse frameworks.

A plausible implication is the value of reporting sparsity–recall curves in future comprehensive sparse attention benchmarks, and motivating further research on hardware-sensitive implementations for these predicted mask patterns (Treviso et al., 2021).

6. Implications and Position within Sparse Attention Research

Sparsefinder proposes a new paradigm for attention efficiency: instead of imposing fixed or random sparsity patterns (as in Longformer, BigBird, Routing Transformer, Reformer), it learns to predict the specific nonzeros of entmax attention from low-dimensional projections and fast grouping. The method fully retains the content sensitivity of the original model and permits exact recovery of all truly attended positions.

This approach advances the field beyond both prior fixed-sparsity (e.g., local windows, global tokens) and recent learned-sparsity methods (e.g., Routing Transformer), presenting a content-driven, highly efficient path for sparse attention computation. The separation of learning projections (one-time, lightweight) and inference-time mask prediction (highly parallelizable, hardware-friendly) positions Sparsefinder as an efficient component for large-scale transformer inference and training when entmax or sparsemax attention is used.

The analysis and empirical results of Treviso et al. suggest that further research should both incorporate such mask prediction benchmarks and explore hardware/software co-designs to optimally leverage Sparsefinder-style masking in high-throughput transformer systems (Treviso et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparsefinder.