Analyzing transformer length generalization in other practical settings
Analyze and characterize the length generalization behavior of transformer architectures beyond the q-sparse token selection task, identifying conditions and mechanisms that enable models trained on shorter sequences to extrapolate reliably to longer sequences in practical applications.
References
There are still many open questions. Can we analyze the length generalization of transformers in other practical settings?
— Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot
(2406.06893 - Wang et al., 11 Jun 2024) in Conclusion