Expected Lipschitz Constant of Softmax Self-Attention in Transformers
Establish whether the softmax self-attention operator used in transformer layers has expected squared Lipschitz constant at most 1 under the stochastic assumptions specified in the paper. Concretely, for inputs U with K columns whose vectors have L2 norm at most 1, query–key matrices A^(ℓ) with i.i.d. N(0,1) entries, and attention computed columnwise as Attn^(ℓ)(U) = Softmax(U^T A^(ℓ) U / sqrt(r)), determine whether the attention mapping obeys E[||Attn^(ℓ)(U) − Attn^(ℓ)(U′)||_F^2] ≤ E[||U − U′||_F^2]. Resolving this would enable elimination of the current quadratic dependence on depth in the transformer estimation error bounds derived in the paper.
References
This is due to fact that it is unknown whether softmax attention obeys the condition that the expected squared lipschitz constant is ≤ 1.