Training in the PDE-based transformer formulation
Develop a rigorous methodology and accompanying theory to optimize the depth-dependent parameters θ_t in the PDE-based continuous-depth transformer model, where token distributions evolve according to the conservation law ∂_t α_t + div(α_t Γ_{θ_t}[α_t]) = 0 with Γ_{θ}[α](x) = (∫ exp(⟨Qx, Ky⟩) V y dα(y)) / (∫ exp(⟨Qx, Kz⟩) dα(z)). Establish how to perform such optimization over (θ_t)_t and characterize its properties for training transformers cast as PDEs.
References
The key challenge lies in understanding the training of the network, which corresponds to optimizing the parameters (θt)t. This remains an open problem.
— Optimal Transport for Machine Learners
(2505.06589 - Peyré, 10 May 2025) in Section 'Wasserstein (gradient) Flows', Subsection 'Application: Evolution in Depth of Transformers'