Next-token prediction capacity for multi-layer and non–back-loaded architectures

Analyze the next-token prediction capacity of multi-layer transformers and of modified architectures such as Mixture of Softmaxes that allocate a greater proportion of parameters to embedding and self-attention than to the feedforward sub-layer (i.e., are not back-loaded).

Background

The paper provides lower bounds for one-layer multi-head decoder-only transformers with back-loaded parameterization, where most parameters are in the feedforward sub-layer. The capacity behavior of deeper transformer stacks and architectures with different parameter distributions remains unaddressed.

The authors explicitly identify extending the capacity analysis to multi-layer transformers and to architectures like Mixture of Softmaxes—where parameters are concentrated earlier in the network—as a future direction.

References

It is also left as a future direction to analyze the next-token prediction capacity for multi-layer transformers and for modified architectures, such as Mixture of Softmaxes, which are not back-loaded, i.e. which have a greater proportion of parameters in the embedding and self-attention sub-layers than in the FNN sub-layer.

— Next-token prediction capacity: general upper bounds and a lower bound for transformers (2405.13718 - Madden et al., 22 May 2024) in Section 9 (Conclusion), first paragraph

Next-token prediction capacity for multi-layer and non–back-loaded architectures

Background

References

Related Problems