Analysis and control of mean-field self-attention dynamics in Transformers
Investigate the dynamics of the McKean–Vlasov partial differential equation ∂_s μ_s + div(μ_s A_{ω_s}(μ_s)) = 0 that models non-causal Transformer self-attention in the mean-field limit, and establish theoretical guarantees for optimizing the parameter path θ = (Q_s, K_s, V_s)_{s ∈ [0,1]} via gradient descent viewed as a PDE control problem.
References
Better understanding these evolutions, as well as the optimization of parameters $\theta = (Q_s, K_s, V_s)_{s \in [0,1]}$ via gradient descent eq:grad-desc, remains an open problem. This problem can be viewed as a control problem for the PDE eq:edp-transformer.
— The Mathematics of Artificial Intelligence
(2501.10465 - Peyré, 15 Jan 2025) in Section “Generative AI for Text”, Mean-Field Representation of Attention paragraph