Sinkformers: Transformers with Doubly Stochastic Attention (2110.11773v2)

Published 22 Oct 2021 in cs.LG and stat.ML

Abstract: Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer. We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior. On the theoretical side, we show that, unlike the SoftMax operation, this normalization makes it possible to understand the iterations of self-attention modules as a discretized gradient-flow for the Wasserstein metric. We also show in the infinite number of samples limit that, when rescaling both attention matrices and depth, Sinkformers operate a heat diffusion. On the experimental side, we show that Sinkformers enhance model accuracy in vision and natural language processing tasks. In particular, on 3D shapes classification, Sinkformers lead to a significant improvement.

Citations (65)

View on Semantic Scholar

Summary

The paper introduces Sinkformers, a Transformer variant that replaces traditional SoftMax attention with doubly stochastic matrices using Sinkhorn's algorithm.
It provides a novel gradient flow interpretation in Wasserstein space, deepening our theoretical understanding of attention mechanisms.
Empirical evaluations show improved accuracy across 3D shape classification, sentiment analysis, neural machine translation, and image tasks.

Sinkformers: An Advancement in Transformer Architecture through Doubly Stochastic Attention

Transformers have become a central framework in machine learning, particularly in NLP and computer vision tasks. The cornerstone of Transformer models is the self-attention mechanism, typically implemented using a row-stochastic matrix normalized via the SoftMax operator. The paper "Sinkformers: Transformers with Doubly Stochastic Attention" introduces an alternate approach by applying Sinkhorn's algorithm to produce doubly stochastic matrices, positing these as an informative prior to facilitate more robust learning.

Core Contributions and Theoretical Insights

The paper makes significant strides in both theoretical foundations and practical performance enhancements.

Doubly Stochastic Attention Matrices: It presents the Sinkformer, a novel variant of the Transformer where the attention matrices are doubly stochastic, achieved through the repeated application of Sinkhorn's algorithm on the original matrix. These matrices have rows and columns that sum to one, purportedly offering a more balanced interaction model compared to traditional row-only stochastic matrices. Classical Transformer attention matrices seem to gravitate naturally towards doubly stochastic models during optimization, thus supporting the premise for Sinkhorn normalization as a suitable prior.
Gradient Flow Interpretation: A noteworthy theoretical advancement is the interpretation of Sinkformer attention processes in terms of gradient flows in Wasserstein space. Unlike the SoftMax operation, which does not conform to a clear gradient flow analogy, the doubly stochastic normalization aligns with the discretized Wasserstein metric gradient-flow interpretation, providing a refined understanding of the operational mechanics, especially within variational frameworks.
Scalability and Heat Diffusion: The extension to the infinite sample limit reveals that, when appropriately scaled, the Sinkformer effectively simulates a heat diffusion process. This implies a potential continuum limit description, drawing parallels between attention mechanisms and diffusion processes, allowing for a fresh perspective on the diffusion-like behaviors of attention-focused modules.

Empirical Evaluation

Empirically, Sinkformers demonstrate consistent improvements over traditional Transformers across several tasks:

3D Shape Classification: On the ModelNet 40 dataset, Sinkformers resulted in substantial accuracy enhancements, indicating a meaningful advancement in spatial data processing.
Sentiment Analysis and Neural Machine Translation (NMT): For NLP tasks such as sentiment classification on the IMDb dataset and NMT on IWSLT'14 German to English, Sinkformers show improved accuracy and BLEU scores. This underscores their capacity to excel in diverse linguistic tasks, reinforcing their effectiveness across modalities.
Vision Tasks: When applied to transformer models for image classification tasks on datasets like MNIST and cats-and-dogs, Sinkformers exhibit superior accuracy, corroborating their versatility and robustness.

Implications and Future Prospects

The switch to doubly stochastic matrices not only amplifies model performance but also opens avenues for deeper scrutiny into the underlying mechanics of attention. This advancement could lead to more nuanced architectures that make use of variants of attention normalization strategies to achieve better generalization.

Future work could delve into exploring the implicit function theorem for computational efficiency, particularly when the number of Sinkhorn iterations is pared down to meet hardware constraints. Additionally, investigating further into the impact of doubly stochastic modules on the stability and generalization of attention mechanisms could yield richer insights.

In summary, this paper leverages theoretical rigor and empirical evaluations to advocate for a transformative look at traditional attention mechanisms in Transformers, marking a significant innovation with far-reaching implications for future AI research and applications.

Related Papers

The Devil in Linear Transformer (2022)
A Survey of Transformers (2021)
Sparse Sinkhorn Attention (2020)
DINT Transformer (2025)
Quantum Doubly Stochastic Transformers (2025)

Tweets

https://twitter.com/tensorqt/status/1860703095644168448