Papers
Topics
Authors
Recent
Search
2000 character limit reached

LOTFormer: Doubly-Stochastic Transformer Attention

Updated 4 February 2026
  • LOTFormer is a linear-time, doubly-stochastic attention mechanism that leverages a low-rank optimal transport formulation with learnable pivots for scalability and robustness.
  • It employs entropic optimal transport solved via the Sinkhorn algorithm to ensure both row and column normalization while reducing computational complexity to O(nr).
  • Empirical results demonstrate that LOTFormer outperforms standard softmax attention in long-range tasks, vision benchmarks, and machine translation, highlighting its practical efficiency.

LOTFormer is a linear-time, doubly-stochastic attention mechanism for Transformers that leverages a low-rank entropic optimal transport (OT) construction via a learnable pivot (reference) measure. LOTFormer achieves O(nr)O(nr) complexity per head—where nn is the sequence length and rnr \ll n the pivot size—while provably enforcing both row and column normalization (doubly-stochasticity) of the attention matrix. This doubly-stochastic property, combined with linear scaling, enables robust and efficient modeling of long-range dependencies and scalability to large input contexts across modalities (Shahbazi et al., 27 Sep 2025).

1. Motivation and Context

The quadratic time and space complexity of standard Transformer attention (O(n2)O(n^2) for nn tokens) forms a principal constraint when modeling long texts, high-resolution images, or long-range dependencies in audio. Linear attention methods (e.g., Linear Transformer, Performer, Nyströmformer) address this by approximating the softmax kernel with feature maps for O(n)O(n) computation but produce row-stochastic attention maps. This row normalization frequently leads to over-concentration (“over-focusing”) on a few tokens and can degrade information flow and robustness.

Doubly-stochastic attention matrices, defined by non-negativity and the property that each row and column sums to one, distribute attention more evenly and empirically improve smoothness, interpretability, and robustness. Existing doubly-stochastic methods, typically based on OT, suffer from prohibitive overhead and do not scale to long contexts.

2. Attention and Optimal Transport

LOTFormer interprets attention matrices as transportation plans between the empirical query and key distributions. With queries QRn×dkQ \in \mathbb{R}^{n \times d_k} and keys KRn×dkK \in \mathbb{R}^{n \times d_k}, the attention problem is cast as entropic regularized OT:

minP0C,P+εH(P)\min_{P \geq 0} \langle C,P\rangle + \varepsilon H(P)

subject to P1=1n1P\mathbf{1} = \tfrac{1}{n}\mathbf{1} and P1=1n1P^\top\mathbf{1} = \tfrac{1}{n}\mathbf{1}, where Cij=qikjC_{ij} = -q_i^\top k_j and H(P)=i,jPijlogPijH(P) = -\sum_{i,j} P_{ij} \log P_{ij}. The optimal PP^* is always doubly-stochastic and trades off similarity alignment and entropy. As ε0+\varepsilon \to 0^+, PP^* becomes sparse; large ε\varepsilon yields diffuse, high-entropy plans.

3. Low-Rank Pivot Construction

To combine scalability with doubly-stochasticity, LOTFormer introduces a small, learnable “pivot” measure σ=t=1rutδzt\sigma = \sum_{t=1}^r u_t \delta_{z_t} parameterized by pivot locations ZRr×dkZ \in \mathbb{R}^{r \times d_k} and pivot masses uΔr1u \in \Delta^{r-1}. The construction involves two entropic OT subproblems:

  • Queries \to Pivot: Γ(1)\Gamma^{(1)} solving OT between queries and pivots.
  • Pivot \to Keys: Γ(2)\Gamma^{(2)} solving OT between pivots and keys.

Each subproblem has O(nr)O(nr) cost, as matrices are n×rn \times r or r×nr \times n. The final attention matrix is the “glued” coupling: A=(Γ(1))Diag(u)1Γ(2)Rn×nA = (\Gamma^{(1)})^\top \mathrm{Diag}(u)^{-1} \Gamma^{(2)} \in \mathbb{R}^{n \times n} which is provably doubly-stochastic and has rank at most rr.

4. Computational Implementation

The attention operation AVAV is computed without explicitly forming AA:

  1. Y=Γ(2)VRr×dvY = \Gamma^{(2)}V \in \mathbb{R}^{r\times d_v}
  2. Z=Diag(u)1YRr×dvZ' = \mathrm{Diag}(u)^{-1}Y \in \mathbb{R}^{r \times d_v}
  3. O=(Γ(1))ZRn×dvO = (\Gamma^{(1)})^\top Z' \in \mathbb{R}^{n \times d_v}

The Sinkhorn algorithm, a differentiable fixed-point procedure, solves the entropic OT problems for Γ(1)\Gamma^{(1)} and Γ(2)\Gamma^{(2)} with complexity O(nr)O(nr) per iteration. The entropic parameter ε\varepsilon interpolates between sharp (deterministic) and diffuse (smooth) attention regimes. Pseudocode is provided for sinkhorn normalization, with attention to numerical stability and convergence characteristics.

5. End-to-End Learning and Integration

All LOTFormer parameters—including the pivot locations ZZ, pivot masses uu, and standard projection matrices WQ,WK,WVW_Q, W_K, W_V—are learned via standard backpropagation through Sinkhorn steps. In Transformer architectures, each attention head replaces softmax attention with the LOTAttn module. For vision tasks, a depth-wise convolution (DWC) on values VV can inject local inductive bias; for [CLS] tokens in ViT, a separate softmax aggregator is used to guarantee global pooling, with all other rows remaining doubly-stochastic.

6. Empirical Results and Performance

LOTFormer achieves strong empirical results across benchmarks:

6.1 Long Range Arena (LRA)

  • Backbone: 6-layer Transformer, d=256d=256, 8 heads, head dim 32.
  • Compared to Softmax, Performer, Nyströmformer, BigBird, PolaFormer, ESPFormer.
  • On LRA, LOTFormer with DWC attains the highest average score.
Model Text ListOps Retrieval Pathfinder Image Avg.
Softmax 61.6 38.7 80.9 70.4 39.1 58.1
Performer 65.4 18.0 53.8 77.1 42.8 51.4
PolaFormer 73.1 37.4 80.5 70.5 42.2 60.7
LOTFormer 65.2 38.5 80.4 73.2 45.7 60.6
+ DWC 71.1 38.5 80.9 69.9 54.1 62.9

(All ±1σ\pm1\sigma over 3 runs; best in bold.)

6.2 Scaling

On sequences up to 2172^{17} tokens, LOTFormer’s runtime scales linearly, while quadratic and other OT methods scale as O(n2)O(n^2). The trade-off parameter rr controls accuracy and speed, with r=16,32,64,128r = 16, 32, 64, 128 empirically showing the O(nr)O(nr) trade-off.

6.3 Vision and Machine Translation

  • On ImageNet-1K (DeiT-Tiny config), LOTFormer achieves a top-1 accuracy of 74.8%, outperforming PolaFormer and baseline DeiT-Tiny at similar parameter count and FLOPs.
  • In machine translation (IWSLT’14 De→En), plug-and-play integration into pretrained Transformers yields BLEU 33.3\sim 33.3–$33.4$. After fine-tuning, LOTFormer surpasses ESPFormer/SinkFormer with BLEU 34.72 (vs 34.64/34.61).

7. Limitations, Trade-Offs, and Extensions

A principal trade-off in LOTFormer is between expressivity and speed: small rr yields O(nr)O(nr) scaling but may poorly approximate full doubly-stochastic softmax; large rr increases accuracy at higher cost, with r=nr=n recovering the full softmax. Robustness benefits from doubly-stochastic constraints, but excessive entropic smoothing can diffuse attention—appropriate tuning of ε\varepsilon and careful pivot learning are essential.

Potential extensions include dynamic adaptation of rr or ZZ, multi-scale pivots for hierarchical coupling, incorporation of alternative cost functions (e.g., positional or geometric terms), and schedule-based adjustment of ε\varepsilon or Sinkhorn iterations during training and inference.

LOTFormer unifies linear efficiency, optimal transport theory, and doubly-stochastic attention in a differentiable framework compatible with end-to-end learning and transformer deployment for both language and vision modalities (Shahbazi et al., 27 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LOTFormer.