Dice Question Streamline Icon: https://streamlinehq.com

Practical usefulness of token-averaged FNNs

Investigate the practical effectiveness of the token-averaged feedforward neural network architecture defined by h_θ(α) = softmax(V^T ψ(W^T Z[:,α] u[1:|α|] + b)) for next-token prediction, including whether this model performs well in realistic training and evaluation settings relative to standard transformer baselines.

Information Square Streamline Icon: https://streamlinehq.com

Background

In Section 8, the authors introduce a simplified alternative to self-attention called token-averaging, which preserves the key injectivity property used in their proofs. They prove that this token-averaged FNN achieves optimal next-token prediction capacity, analogous to their results for transformers.

Despite the theoretical capacity guarantees, the authors emphasize that memory capacity alone does not determine optimization or generalization performance. They therefore explicitly identify the practical utility of token-averaged FNNs as an open direction for empirical and applied paper.

References

So, we leave it as a future research direction to explore the usefulness of token-averaged FNNs in practice.

Next-token prediction capacity: general upper bounds and a lower bound for transformers (2405.13718 - Madden et al., 22 May 2024) in Section 8 (Token-averaging), final paragraph