Dice Question Streamline Icon: https://streamlinehq.com

Expected Lipschitz Constant of Softmax Self-Attention in Transformers

Establish whether the softmax self-attention operator used in transformer layers has expected squared Lipschitz constant at most 1 under the stochastic assumptions specified in the paper. Concretely, for inputs U with K columns whose vectors have L2 norm at most 1, query–key matrices A^(ℓ) with i.i.d. N(0,1) entries, and attention computed columnwise as Attn^(ℓ)(U) = Softmax(U^T A^(ℓ) U / sqrt(r)), determine whether the attention mapping obeys E[||Attn^(ℓ)(U) − Attn^(ℓ)(U′)||_F^2] ≤ E[||U − U′||_F^2]. Resolving this would enable elimination of the current quadratic dependence on depth in the transformer estimation error bounds derived in the paper.

Information Square Streamline Icon: https://streamlinehq.com

Background

In Section ‘Learning from Sequences,’ the authors derive estimation error bounds for autoregressive transformer processes. A key technical step involves bounding the squared Lipschitz constant of transformer layers, which currently leads to a quadratic dependence on depth in the final bounds.

The authors note that this depth dependence could be eliminated if the softmax self-attention operator were confirmed to be 1-Lipschitz in expectation with respect to the squared Frobenius norm. However, they explicitly state that it is unknown whether softmax attention satisfies this property, motivating the open problem.

References

This is due to fact that it is unknown whether softmax attention obeys the condition that the expected squared lipschitz constant is ≤ 1.

Information-Theoretic Foundations for Machine Learning (2407.12288 - Jeon et al., 17 Jul 2024) in Learning from Sequences → Transformer Process → Main Result