Sparsefinder: Efficient Sparse Entmax
- Sparsefinder is a method that predicts the exact sparsity pattern of entmax-based attention, reducing dense matrix computation by focusing on nonzero entries.
- It uses low-dimensional projections with contrastive hinge loss and grouping strategies (distance, quantization, clustering) to efficiently approximate attention edges.
- Empirical evaluations in machine translation and language modeling show that Sparsefinder achieves near full recall with significant reductions in FLOPs and memory usage.
Sparsefinder is a method for efficiently predicting the exact sparsity pattern of entmax-based attention in transformer models, allowing one to compute only those attention edges that will receive nonzero probability, with the goal of significant reductions in both computation and memory. Originally introduced by Treviso et al. (2021), Sparsefinder addresses the challenge that, while entmax produces sparse attention distributions, it still requires forming and normalizing a dense attention matrix with complexity, where and are the sequence lengths. Sparsefinder predicts the sparse attention graph prior to the actual attention computation, enabling focused computation on attended positions and achieving improved efficiency without sacrificing the accuracy or recall of the attention mechanism (Treviso et al., 2021).
1. Mathematical Foundations of Entmax Attention
Sparsefinder builds directly on entmax attention mechanisms. Whereas standard Transformer attention computes , with queries and keys , then applies softmax normalization giving dense distributions, entmax replaces softmax with a Tsallis-type entropy regularizer. The entmax function is defined by
where 0 for 1 (recovering softmax), and for 2,
3
For 4, the attention distribution is sparse: coordinates with 5 are exactly zero, and nonzero values have the closed-form
6
Popular choices include 7 (sparsemax) and 8 (typical in this work for a trade-off between sparsity and ease of computation).
The "attention graph" for each head is
9
which is typically much sparser than the dense 0 matrix. Due to the "sparse-consistency" property of entmax, if a mask 1 is predicted and the attention computation is restricted to this mask, the result is exactly equal on all truly nonzero entries.
2. Sparsefinder Model Architecture and Variants
Sparsefinder operates in two key stages:
(A) Learning Low-Dimensional Projections:
For every head 2, learn linear projections 3 (with 4) and compute projected queries and keys 5, 6. Training is by a contrastive hinge loss:
7
where 8 is a margin. The result is a space in which true attended pairs are close.
(B) Edge Prediction via Grouping Strategies:
- Distance-Based Sparsefinder:
Predict 9 as an edge if 0 for threshold 1. All 2 distances are computed, trading off sparsity and recall by tuning 3.
- Quantization-Based Sparsefinder:
Each of 4 dimensions is partitioned into 5 quantile bins. Tokens sharing a bin in at least one dimension are assumed adjacent. No additional learning is needed beyond projection. Edges scale as 6.
- Clustering-Based Sparsefinder:
Offline 7-means clustering is run on projected vectors, yielding 8 centroids. Each 9, 0 is assigned to its 1 nearest centroids. Edge 2 is predicted present if assigned clusters intersect. The parameter 3 controls sparsity/recall.
Sparsefinder can be combined with fixed local patterns (e.g., sliding windows of width 4) and global tokens as used in architectures such as Longformer or BigBird.
3. Computational and Memory Complexity
The computational improvements arise from predicting 5 so that full 6 need not be formed.
| Approach | FLOPs (score computation) | Memory (mask) |
|---|---|---|
| Dense entmax7 | 8 | 9 |
| Sparsefinder-projection | 0 | 1 |
| Sparsefinder-mask (dist.) | 2 | 3 |
| Sparsefinder-dot products | 4 (5) | -- |
For quantization, bucket assignment cost is 6 and candidate edges 7. Clustering-based mask generation is 8 and candidate edges scale as 9 for balanced clusters with 0 assignments per token. If 1 and 2, substantial savings are obtained relative to the standard 3 complexity.
4. Empirical Evaluation in Machine Translation and Language Modeling
Sparsefinder was evaluated in both machine translation (MT) and masked language modeling (MLM) settings:
MT Experiments:
A "transformer-large" (6 layers, 16 heads, 4) model was pretrained on Paracrawl and fine-tuned with 5-entmax (6) on IWSLT'17 EN7DE and EN8FR, with BPE vocab size 32k. Projections from 9 to 0 dimensions were learned per head, using one epoch of margin-hinge loss on 10k long sentences (18--9M positive pairs/layer). Hyperparameters were swept over distance thresholds 2, bucket counts 3, and window sizes 4.
- Pareto frontiers of sparsity vs. recall showed that distance-based and clustering-based Sparsefinder variants strictly dominate baselines (Longformer, BigBird, Reformer, Routing Transformer, local window) for any given sparsity
- For both EN5DE and EN6FR, distance and clustering recover nearly full BLEU at the sparsity of the dense entmax Transformer, e.g., BLEU 34.47 (EN7DE), 42.65 (EN8FR)
MLM Experiments:
RoBERTa-base (9, 0) was fine-tuned on WikiText-103 for 3k steps with 1 entmax (2, overall sparsity 3). Projection from 4 to 5 was trained, and grouping parameters swept.
- Distance and clustering Sparsefinder achieved strictly better recall or perplexity for all sparsity levels vs. Longformer, BigBird, Reformer, Routing Transformer, and local baselines
All experiments ran on NVIDIA Titan Xp/GTX 1080Ti/RTX 2080 Ti GPUs and AMD/Intel CPUs; projection learning used 128 positive/negative pairs, Adam LR=6, margin 7, for one epoch.
5. Design Trade-Offs and Practical Considerations
The primary controlling parameters are the distance threshold 8 (distance-based) and cluster-assignment 9 (clustering-based), directly trading sparsity for recall: lower values increase sparsity while reducing recall; higher values increase recall at the expense of more computation. Quantization-based grouping, while algorithmically simplest, underperforms: it cannot adapt to irregular neighborhoods in the projection space.
Clustering achieves performance near that of distance-based grouping but at lower computational cost, since pairwise distances are not required. For balanced clusters, 0 and 1 provide 2 mask generation and high recall. Distance-based grouping remains an upper bound on the attainable trade-off, and is computationally efficient when 3.
In both MT and MLM tasks, Sparsefinder’s distance and clustering variants define new Pareto fronts in sparsity–recall and task performance, outperforming established fixed and learned pattern baselines. The evidence indicates that efficient prediction of content-driven sparse patterns is not only feasible but more accurate than prior fixed or dynamically-parameterized sparse frameworks.
A plausible implication is the value of reporting sparsity–recall curves in future comprehensive sparse attention benchmarks, and motivating further research on hardware-sensitive implementations for these predicted mask patterns (Treviso et al., 2021).
6. Implications and Position within Sparse Attention Research
Sparsefinder proposes a new paradigm for attention efficiency: instead of imposing fixed or random sparsity patterns (as in Longformer, BigBird, Routing Transformer, Reformer), it learns to predict the specific nonzeros of entmax attention from low-dimensional projections and fast grouping. The method fully retains the content sensitivity of the original model and permits exact recovery of all truly attended positions.
This approach advances the field beyond both prior fixed-sparsity (e.g., local windows, global tokens) and recent learned-sparsity methods (e.g., Routing Transformer), presenting a content-driven, highly efficient path for sparse attention computation. The separation of learning projections (one-time, lightweight) and inference-time mask prediction (highly parallelizable, hardware-friendly) positions Sparsefinder as an efficient component for large-scale transformer inference and training when entmax or sparsemax attention is used.
The analysis and empirical results of Treviso et al. suggest that further research should both incorporate such mask prediction benchmarks and explore hardware/software co-designs to optimally leverage Sparsefinder-style masking in high-throughput transformer systems (Treviso et al., 2021).