Linear Transformers with Learnable Kernel Functions are Better In-Context Models (2402.10644v2)
Abstract: Advancing the frontier of subquadratic architectures for LLMs (LMs) is crucial in the rapidly evolving field of natural language processing. Current innovations, including State Space Models, were initially celebrated for surpassing Transformer performance on language modeling tasks. However, these models have revealed deficiencies in essential In-Context Learning capabilities - a domain where the Transformer traditionally shines. The Based model emerged as a hybrid solution, blending a Linear Transformer with a kernel inspired by the Taylor expansion of exponential functions, augmented by convolutional networks. Mirroring the Transformer's in-context adeptness, it became a strong contender in the field. In our work, we present a singular, elegant alteration to the Based kernel that amplifies its In-Context Learning abilities evaluated with the Multi-Query Associative Recall task and overall language modeling process, as demonstrated on the Pile dataset.
- Never train from scratch: Fair comparison of long-sequence models requires data-driven priors. CoRR, abs/2310.02980.
- Zoology: Measuring and improving recall in efficient language models. arXiv:2312.04927.
- Layer normalization. Cite arxiv:1607.06450.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. Cite arxiv:1412.3555Comment: Presented in NIPS 2014 Deep Learning and Representation Learning Workshop.
- Scaling vision transformers to 22 billion parameters. International Conference on Machine Learning.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Expected validation performance and estimation of a random variable’s maximum. CoRR, abs/2110.00613.
- Hungry Hungry Hippos: Towards language modeling with state space models. In International Conference on Learning Representations.
- Simple hardware-efficient long convolutions for sequence modeling. International Conference on Machine Learning.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations.
- How to train your HIPPO: State space models with generalized orthogonal basis projections. In International Conference on Learning Representations.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv: 2402.01032.
- Mistral 7b.
- Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR.
- In-context learning and induction heads. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- RWKV: Reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore. Association for Computational Linguistics.
- Random feature attention. ArXiv, abs/2103.02143.
- Hyena hierarchy: Towards larger convolutional language models. International Conference on Machine Learning.
- cosformer: Rethinking softmax in attention. ArXiv, abs/2202.08791.
- Language models are unsupervised multitask learners.
- Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning.
- Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations.
- Llama 2: Open foundation and fine-tuned chat models.
- State spaces aren’t enough: Machine translation needs attention. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, Tampere, Finland, 12-15 June 2023, pages 205–216. European Association for Machine Translation.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.