LOTFormer: Doubly-Stochastic Transformer Attention
- LOTFormer is a linear-time, doubly-stochastic attention mechanism that leverages a low-rank optimal transport formulation with learnable pivots for scalability and robustness.
- It employs entropic optimal transport solved via the Sinkhorn algorithm to ensure both row and column normalization while reducing computational complexity to O(nr).
- Empirical results demonstrate that LOTFormer outperforms standard softmax attention in long-range tasks, vision benchmarks, and machine translation, highlighting its practical efficiency.
LOTFormer is a linear-time, doubly-stochastic attention mechanism for Transformers that leverages a low-rank entropic optimal transport (OT) construction via a learnable pivot (reference) measure. LOTFormer achieves complexity per head—where is the sequence length and the pivot size—while provably enforcing both row and column normalization (doubly-stochasticity) of the attention matrix. This doubly-stochastic property, combined with linear scaling, enables robust and efficient modeling of long-range dependencies and scalability to large input contexts across modalities (Shahbazi et al., 27 Sep 2025).
1. Motivation and Context
The quadratic time and space complexity of standard Transformer attention ( for tokens) forms a principal constraint when modeling long texts, high-resolution images, or long-range dependencies in audio. Linear attention methods (e.g., Linear Transformer, Performer, Nyströmformer) address this by approximating the softmax kernel with feature maps for computation but produce row-stochastic attention maps. This row normalization frequently leads to over-concentration (“over-focusing”) on a few tokens and can degrade information flow and robustness.
Doubly-stochastic attention matrices, defined by non-negativity and the property that each row and column sums to one, distribute attention more evenly and empirically improve smoothness, interpretability, and robustness. Existing doubly-stochastic methods, typically based on OT, suffer from prohibitive overhead and do not scale to long contexts.
2. Attention and Optimal Transport
LOTFormer interprets attention matrices as transportation plans between the empirical query and key distributions. With queries and keys , the attention problem is cast as entropic regularized OT:
subject to and , where and . The optimal is always doubly-stochastic and trades off similarity alignment and entropy. As , becomes sparse; large yields diffuse, high-entropy plans.
3. Low-Rank Pivot Construction
To combine scalability with doubly-stochasticity, LOTFormer introduces a small, learnable “pivot” measure parameterized by pivot locations and pivot masses . The construction involves two entropic OT subproblems:
- Queries Pivot: solving OT between queries and pivots.
- Pivot Keys: solving OT between pivots and keys.
Each subproblem has cost, as matrices are or . The final attention matrix is the “glued” coupling: which is provably doubly-stochastic and has rank at most .
4. Computational Implementation
The attention operation is computed without explicitly forming :
The Sinkhorn algorithm, a differentiable fixed-point procedure, solves the entropic OT problems for and with complexity per iteration. The entropic parameter interpolates between sharp (deterministic) and diffuse (smooth) attention regimes. Pseudocode is provided for sinkhorn normalization, with attention to numerical stability and convergence characteristics.
5. End-to-End Learning and Integration
All LOTFormer parameters—including the pivot locations , pivot masses , and standard projection matrices —are learned via standard backpropagation through Sinkhorn steps. In Transformer architectures, each attention head replaces softmax attention with the LOTAttn module. For vision tasks, a depth-wise convolution (DWC) on values can inject local inductive bias; for [CLS] tokens in ViT, a separate softmax aggregator is used to guarantee global pooling, with all other rows remaining doubly-stochastic.
6. Empirical Results and Performance
LOTFormer achieves strong empirical results across benchmarks:
6.1 Long Range Arena (LRA)
- Backbone: 6-layer Transformer, , 8 heads, head dim 32.
- Compared to Softmax, Performer, Nyströmformer, BigBird, PolaFormer, ESPFormer.
- On LRA, LOTFormer with DWC attains the highest average score.
| Model | Text | ListOps | Retrieval | Pathfinder | Image | Avg. |
|---|---|---|---|---|---|---|
| Softmax | 61.6 | 38.7 | 80.9 | 70.4 | 39.1 | 58.1 |
| Performer | 65.4 | 18.0 | 53.8 | 77.1 | 42.8 | 51.4 |
| PolaFormer | 73.1 | 37.4 | 80.5 | 70.5 | 42.2 | 60.7 |
| LOTFormer | 65.2 | 38.5 | 80.4 | 73.2 | 45.7 | 60.6 |
| + DWC | 71.1 | 38.5 | 80.9 | 69.9 | 54.1 | 62.9 |
(All over 3 runs; best in bold.)
6.2 Scaling
On sequences up to tokens, LOTFormer’s runtime scales linearly, while quadratic and other OT methods scale as . The trade-off parameter controls accuracy and speed, with empirically showing the trade-off.
6.3 Vision and Machine Translation
- On ImageNet-1K (DeiT-Tiny config), LOTFormer achieves a top-1 accuracy of 74.8%, outperforming PolaFormer and baseline DeiT-Tiny at similar parameter count and FLOPs.
- In machine translation (IWSLT’14 De→En), plug-and-play integration into pretrained Transformers yields BLEU –$33.4$. After fine-tuning, LOTFormer surpasses ESPFormer/SinkFormer with BLEU 34.72 (vs 34.64/34.61).
7. Limitations, Trade-Offs, and Extensions
A principal trade-off in LOTFormer is between expressivity and speed: small yields scaling but may poorly approximate full doubly-stochastic softmax; large increases accuracy at higher cost, with recovering the full softmax. Robustness benefits from doubly-stochastic constraints, but excessive entropic smoothing can diffuse attention—appropriate tuning of and careful pivot learning are essential.
Potential extensions include dynamic adaptation of or , multi-scale pivots for hierarchical coupling, incorporation of alternative cost functions (e.g., positional or geometric terms), and schedule-based adjustment of or Sinkhorn iterations during training and inference.
LOTFormer unifies linear efficiency, optimal transport theory, and doubly-stochastic attention in a differentiable framework compatible with end-to-end learning and transformer deployment for both language and vision modalities (Shahbazi et al., 27 Sep 2025).