Training in the PDE-based transformer formulation

Develop a rigorous methodology and accompanying theory to optimize the depth-dependent parameters θ_t in the PDE-based continuous-depth transformer model, where token distributions evolve according to the conservation law ∂_t α_t + div(α_t Γ_{θ_t}[α_t]) = 0 with Γ_{θ}[α](x) = (∫ exp(⟨Qx, Ky⟩) V y dα(y)) / (∫ exp(⟨Qx, Kz⟩) dα(z)). Establish how to perform such optimization over (θ_t)_t and characterize its properties for training transformers cast as PDEs.

Background

The notes propose modeling very deep transformers by a continuum-depth limit, yielding a PDE that governs the evolution of the empirical token distribution αt. With a single-head attention mechanism (ignoring MLP layers, normalization, and masking), the update x_i → x_i + (1/T) Γθα induces an evolution α{t+τ} = (Id + τ Γ{θt}[α_t])♯ αt and, in the limit τ → 0, the PDE ∂_t α_t + div(α_t Γ{θ_t}[α_t]) = 0.

While this formulation connects transformers to transport PDEs, the notes highlight that understanding how to train such a model—i.e., how to optimize the time-dependent parameters θt = (K, Q, V) driving Γ{θ_t}[α_t]—remains unresolved.

References

The key challenge lies in understanding the training of the network, which corresponds to optimizing the parameters (θt)t. This remains an open problem.

— Optimal Transport for Machine Learners (2505.06589 - Peyré, 10 May 2025) in Section 'Wasserstein (gradient) Flows', Subsection 'Application: Evolution in Depth of Transformers'

Training in the PDE-based transformer formulation

Background

References

Related Problems