Practical usefulness of token-averaged FNNs
Investigate the practical effectiveness of the token-averaged feedforward neural network architecture defined by h_θ(α) = softmax(V^T ψ(W^T Z[:,α] u[1:|α|] + b)) for next-token prediction, including whether this model performs well in realistic training and evaluation settings relative to standard transformer baselines.
References
So, we leave it as a future research direction to explore the usefulness of token-averaged FNNs in practice.
— Next-token prediction capacity: general upper bounds and a lower bound for transformers
(2405.13718 - Madden et al., 22 May 2024) in Section 8 (Token-averaging), final paragraph