LOTFormer: Doubly-Stochastic Transformer Attention

Updated 4 February 2026

LOTFormer is a linear-time, doubly-stochastic attention mechanism that leverages a low-rank optimal transport formulation with learnable pivots for scalability and robustness.
It employs entropic optimal transport solved via the Sinkhorn algorithm to ensure both row and column normalization while reducing computational complexity to O(nr).
Empirical results demonstrate that LOTFormer outperforms standard softmax attention in long-range tasks, vision benchmarks, and machine translation, highlighting its practical efficiency.

LOTFormer is a linear-time, doubly-stochastic attention mechanism for Transformers that leverages a low-rank entropic optimal transport (OT) construction via a learnable pivot (reference) measure. LOTFormer achieves $O(nr)$ complexity per head—where $n$ is the sequence length and $r \ll n$ the pivot size—while provably enforcing both row and column normalization (doubly-stochasticity) of the attention matrix. This doubly-stochastic property, combined with linear scaling, enables robust and efficient modeling of long-range dependencies and scalability to large input contexts across modalities (Shahbazi et al., 27 Sep 2025).

1. Motivation and Context

The quadratic time and space complexity of standard Transformer attention ( $O(n^2)$ for $n$ tokens) forms a principal constraint when modeling long texts, high-resolution images, or long-range dependencies in audio. Linear attention methods (e.g., Linear Transformer, Performer, Nyströmformer) address this by approximating the softmax kernel with feature maps for $O(n)$ computation but produce row-stochastic attention maps. This row normalization frequently leads to over-concentration (“over-focusing”) on a few tokens and can degrade information flow and robustness.

Doubly-stochastic attention matrices, defined by non-negativity and the property that each row and column sums to one, distribute attention more evenly and empirically improve smoothness, interpretability, and robustness. Existing doubly-stochastic methods, typically based on OT, suffer from prohibitive overhead and do not scale to long contexts.

2. Attention and Optimal Transport

LOTFormer interprets attention matrices as transportation plans between the empirical query and key distributions. With queries $Q \in \mathbb{R}^{n \times d_k}$ and keys $K \in \mathbb{R}^{n \times d_k}$ , the attention problem is cast as entropic regularized OT:

$\min_{P \geq 0} \langle C,P\rangle + \varepsilon H(P)$

subject to $P\mathbf{1} = \tfrac{1}{n}\mathbf{1}$ and $P^\top\mathbf{1} = \tfrac{1}{n}\mathbf{1}$ , where $C_{ij} = -q_i^\top k_j$ and $H(P) = -\sum_{i,j} P_{ij} \log P_{ij}$ . The optimal $P^*$ is always doubly-stochastic and trades off similarity alignment and entropy. As $\varepsilon \to 0^+$ , $P^*$ becomes sparse; large $\varepsilon$ yields diffuse, high-entropy plans.

3. Low-Rank Pivot Construction

To combine scalability with doubly-stochasticity, LOTFormer introduces a small, learnable “pivot” measure $\sigma = \sum_{t=1}^r u_t \delta_{z_t}$ parameterized by pivot locations $Z \in \mathbb{R}^{r \times d_k}$ and pivot masses $u \in \Delta^{r-1}$ . The construction involves two entropic OT subproblems:

Queries $\to$ Pivot: $\Gamma^{(1)}$ solving OT between queries and pivots.
Pivot $\to$ Keys: $\Gamma^{(2)}$ solving OT between pivots and keys.

Each subproblem has $O(nr)$ cost, as matrices are $n \times r$ or $r \times n$ . The final attention matrix is the “glued” coupling: $A = (\Gamma^{(1)})^\top \mathrm{Diag}(u)^{-1} \Gamma^{(2)} \in \mathbb{R}^{n \times n}$ which is provably doubly-stochastic and has rank at most $r$ .

4. Computational Implementation

The attention operation $AV$ is computed without explicitly forming $A$ :

$Y = \Gamma^{(2)}V \in \mathbb{R}^{r\times d_v}$
$Z' = \mathrm{Diag}(u)^{-1}Y \in \mathbb{R}^{r \times d_v}$
$O = (\Gamma^{(1)})^\top Z' \in \mathbb{R}^{n \times d_v}$

The Sinkhorn algorithm, a differentiable fixed-point procedure, solves the entropic OT problems for $\Gamma^{(1)}$ and $\Gamma^{(2)}$ with complexity $O(nr)$ per iteration. The entropic parameter $\varepsilon$ interpolates between sharp (deterministic) and diffuse (smooth) attention regimes. Pseudocode is provided for sinkhorn normalization, with attention to numerical stability and convergence characteristics.

5. End-to-End Learning and Integration

All LOTFormer parameters—including the pivot locations $Z$ , pivot masses $u$ , and standard projection matrices $W_Q, W_K, W_V$ —are learned via standard backpropagation through Sinkhorn steps. In Transformer architectures, each attention head replaces softmax attention with the LOTAttn module. For vision tasks, a depth-wise convolution (DWC) on values $V$ can inject local inductive bias; for [CLS] tokens in ViT, a separate softmax aggregator is used to guarantee global pooling, with all other rows remaining doubly-stochastic.

6. Empirical Results and Performance

LOTFormer achieves strong empirical results across benchmarks:

6.1 Long Range Arena (LRA)

Backbone: 6-layer Transformer, $d=256$ , 8 heads, head dim 32.
Compared to Softmax, Performer, Nyströmformer, BigBird, PolaFormer, ESPFormer.
On LRA, LOTFormer with DWC attains the highest average score.

Model	Text	ListOps	Retrieval	Pathfinder	Image	Avg.
Softmax	61.6	38.7	80.9	70.4	39.1	58.1
Performer	65.4	18.0	53.8	77.1	42.8	51.4
PolaFormer	73.1	37.4	80.5	70.5	42.2	60.7
LOTFormer	65.2	38.5	80.4	73.2	45.7	60.6
+ DWC	71.1	38.5	80.9	69.9	54.1	62.9

(All $\pm1\sigma$ over 3 runs; best in bold.)

6.2 Scaling

On sequences up to $2^{17}$ tokens, LOTFormer’s runtime scales linearly, while quadratic and other OT methods scale as $O(n^2)$ . The trade-off parameter $r$ controls accuracy and speed, with $r = 16, 32, 64, 128$ empirically showing the $O(nr)$ trade-off.

6.3 Vision and Machine Translation

On ImageNet-1K (DeiT-Tiny config), LOTFormer achieves a top-1 accuracy of 74.8%, outperforming PolaFormer and baseline DeiT-Tiny at similar parameter count and FLOPs.
In machine translation (IWSLT’14 De→En), plug-and-play integration into pretrained Transformers yields BLEU $\sim 33.3$ –$33.4$. After fine-tuning, LOTFormer surpasses ESPFormer/SinkFormer with BLEU 34.72 (vs 34.64/34.61).

7. Limitations, Trade-Offs, and Extensions

A principal trade-off in LOTFormer is between expressivity and speed: small $r$ yields $O(nr)$ scaling but may poorly approximate full doubly-stochastic softmax; large $r$ increases accuracy at higher cost, with $r=n$ recovering the full softmax. Robustness benefits from doubly-stochastic constraints, but excessive entropic smoothing can diffuse attention—appropriate tuning of $\varepsilon$ and careful pivot learning are essential.

Potential extensions include dynamic adaptation of $r$ or $Z$ , multi-scale pivots for hierarchical coupling, incorporation of alternative cost functions (e.g., positional or geometric terms), and schedule-based adjustment of $\varepsilon$ or Sinkhorn iterations during training and inference.

LOTFormer unifies linear efficiency, optimal transport theory, and doubly-stochastic attention in a differentiable framework compatible with end-to-end learning and transformer deployment for both language and vision modalities (Shahbazi et al., 27 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LOTFormer.

LOTFormer: Doubly-Stochastic Transformer Attention

1. Motivation and Context

2. Attention and Optimal Transport

3. Low-Rank Pivot Construction

4. Computational Implementation

5. End-to-End Learning and Integration

6. Empirical Results and Performance

6.1 Long Range Arena (LRA)

6.2 Scaling

6.3 Vision and Machine Translation

7. Limitations, Trade-Offs, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LOTFormer: Doubly-Stochastic Transformer Attention

1. Motivation and Context

2. Attention and Optimal Transport

3. Low-Rank Pivot Construction

4. Computational Implementation

5. End-to-End Learning and Integration

6. Empirical Results and Performance

6.1 Long Range Arena (LRA)

6.2 Scaling

6.3 Vision and Machine Translation

7. Limitations, Trade-Offs, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research