TRAMS: Training-free Memory Selection for Long-range Language Modeling (2310.15494v3)
Abstract: The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range LLMing. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies, existing methods like Transformer-XL are plagued by a high percentage of ineffective memories. In this study, we present a plug-and-play strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens participating in attention calculation based on one simple metric. This strategy allows us to keep tokens that are likely to have a high attention score with the current queries and ignore the other ones. We have tested our approach on the word-level benchmark (WikiText-103) and the character-level benchmark (enwik8), and the results indicate an improvement without having additional training or adding additional parameters.
- Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Hybrid random features. arXiv preprint arXiv:2110.04367.
- Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988.
- Smyrf-efficient attention using asymmetric clustering. Advances in Neural Information Processing Systems, 33:6476–6489.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
- Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
- When and why is document-level context useful in neural machine translation? In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pages 24–34.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Reformer: The efficient transformer. In International Conference on Learning Representations.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Matt Mahoney. 2011. Large text compression benchmark.
- Pointer sentinel mixture models. In International Conference on Learning Representations.
- Abc: Attention with bounded-memory control. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7469–7483.
- Random feature attention. In International Conference on Learning Representations.
- Sparsifying transformer models with trainable representation pooling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8616–8633.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68.
- Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335.
- Not all memories are created equal: Learning to forget by expiring. In International Conference on Machine Learning, pages 9902–9912. PMLR.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28.
- Attention is all you need. Advances in neural information processing systems, 30.
- Fast transformers with clustered attention. Advances in Neural Information Processing Systems, 33:21665–21674.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2947–2954.
- Linear complexity randomized self-attention mechanism. In International Conference on Machine Learning, pages 27011–27041. PMLR.
- Efficient attention via control variates. In The Eleventh International Conference on Learning Representations.
- Recurrentgpt: Interactive generation of (arbitrarily) long text. arXiv preprint arXiv:2305.13304.
- Haofei Yu (17 papers)
- Cunxiang Wang (30 papers)
- Yue Zhang (618 papers)
- Wei Bi (62 papers)