Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TRAMS: Training-free Memory Selection for Long-range Language Modeling (2310.15494v3)

Published 24 Oct 2023 in cs.CL

Abstract: The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range LLMing. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies, existing methods like Transformer-XL are plagued by a high percentage of ineffective memories. In this study, we present a plug-and-play strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens participating in attention calculation based on one simple metric. This strategy allows us to keep tokens that are likely to have a high attention score with the current queries and ignore the other ones. We have tested our approach on the word-level benchmark (WikiText-103) and the character-level benchmark (enwik8), and the results indicate an improvement without having additional training or adding additional parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  4. Hybrid random features. arXiv preprint arXiv:2110.04367.
  5. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988.
  6. Smyrf-efficient attention using asymmetric clustering. Advances in Neural Information Processing Systems, 33:6476–6489.
  7. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  8. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations.
  9. When and why is document-level context useful in neural machine translation? In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pages 24–34.
  10. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  11. Reformer: The efficient transformer. In International Conference on Learning Representations.
  12. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
  13. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  14. Matt Mahoney. 2011. Large text compression benchmark.
  15. Pointer sentinel mixture models. In International Conference on Learning Representations.
  16. Abc: Attention with bounded-memory control. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7469–7483.
  17. Random feature attention. In International Conference on Learning Representations.
  18. Sparsifying transformer models with trainable representation pooling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8616–8633.
  19. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  20. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68.
  21. Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 331–335.
  22. Not all memories are created equal: Learning to forget by expiring. In International Conference on Machine Learning, pages 9902–9912. PMLR.
  23. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28.
  24. Attention is all you need. Advances in neural information processing systems, 30.
  25. Fast transformers with clustered attention. Advances in Neural Information Processing Systems, 33:21665–21674.
  26. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
  27. Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2947–2954.
  28. Linear complexity randomized self-attention mechanism. In International Conference on Machine Learning, pages 27011–27041. PMLR.
  29. Efficient attention via control variates. In The Eleventh International Conference on Learning Representations.
  30. Recurrentgpt: Interactive generation of (arbitrarily) long text. arXiv preprint arXiv:2305.13304.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haofei Yu (17 papers)
  2. Cunxiang Wang (30 papers)
  3. Yue Zhang (618 papers)
  4. Wei Bi (62 papers)
Citations (3)