Next-token prediction capacity for multi-layer and non–back-loaded architectures
Analyze the next-token prediction capacity of multi-layer transformers and of modified architectures such as Mixture of Softmaxes that allocate a greater proportion of parameters to embedding and self-attention than to the feedforward sub-layer (i.e., are not back-loaded).
References
It is also left as a future direction to analyze the next-token prediction capacity for multi-layer transformers and for modified architectures, such as Mixture of Softmaxes, which are not back-loaded, i.e. which have a greater proportion of parameters in the embedding and self-attention sub-layers than in the FNN sub-layer.
— Next-token prediction capacity: general upper bounds and a lower bound for transformers
(2405.13718 - Madden et al., 22 May 2024) in Section 9 (Conclusion), first paragraph