Dice Question Streamline Icon: https://streamlinehq.com

Training in the PDE-based transformer formulation

Develop a rigorous methodology and accompanying theory to optimize the depth-dependent parameters θ_t in the PDE-based continuous-depth transformer model, where token distributions evolve according to the conservation law ∂_t α_t + div(α_t Γ_{θ_t}[α_t]) = 0 with Γ_{θ}[α](x) = (∫ exp(⟨Qx, Ky⟩) V y dα(y)) / (∫ exp(⟨Qx, Kz⟩) dα(z)). Establish how to perform such optimization over (θ_t)_t and characterize its properties for training transformers cast as PDEs.

Information Square Streamline Icon: https://streamlinehq.com

Background

The notes propose modeling very deep transformers by a continuum-depth limit, yielding a PDE that governs the evolution of the empirical token distribution αt. With a single-head attention mechanism (ignoring MLP layers, normalization, and masking), the update x_i → x_i + (1/T) Γθα induces an evolution α{t+τ} = (Id + τ Γt}[α_t])♯ αt and, in the limit τ → 0, the PDE ∂_t α_t + div(α_t Γ{θ_t}[α_t]) = 0.

While this formulation connects transformers to transport PDEs, the notes highlight that understanding how to train such a model—i.e., how to optimize the time-dependent parameters θt = (K, Q, V) driving Γ{θ_t}[α_t]—remains unresolved.

References

The key challenge lies in understanding the training of the network, which corresponds to optimizing the parameters (θt)t. This remains an open problem.

Optimal Transport for Machine Learners (2505.06589 - Peyré, 10 May 2025) in Section 'Wasserstein (gradient) Flows', Subsection 'Application: Evolution in Depth of Transformers'