Expected Lipschitz Constant of Softmax Self-Attention in Transformers

Establish whether the softmax self-attention operator used in transformer layers has expected squared Lipschitz constant at most 1 under the stochastic assumptions specified in the paper. Concretely, for inputs U with K columns whose vectors have L2 norm at most 1, query–key matrices A^(ℓ) with i.i.d. N(0,1) entries, and attention computed columnwise as Attn^(ℓ)(U) = Softmax(U^T A^(ℓ) U / sqrt(r)), determine whether the attention mapping obeys E[||Attn^(ℓ)(U) − Attn^(ℓ)(U′)||_F^2] ≤ E[||U − U′||_F^2]. Resolving this would enable elimination of the current quadratic dependence on depth in the transformer estimation error bounds derived in the paper.

Background

In Section ‘Learning from Sequences,’ the authors derive estimation error bounds for autoregressive transformer processes. A key technical step involves bounding the squared Lipschitz constant of transformer layers, which currently leads to a quadratic dependence on depth in the final bounds.

The authors note that this depth dependence could be eliminated if the softmax self-attention operator were confirmed to be 1-Lipschitz in expectation with respect to the squared Frobenius norm. However, they explicitly state that it is unknown whether softmax attention satisfies this property, motivating the open problem.

References

This is due to fact that it is unknown whether softmax attention obeys the condition that the expected squared lipschitz constant is ≤ 1.

— Information-Theoretic Foundations for Machine Learning (2407.12288 - Jeon et al., 2024) in Learning from Sequences → Transformer Process → Main Result

Expected Lipschitz Constant of Softmax Self-Attention in Transformers

Sponsor

Background

References

Related Problems