CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory (2402.13449v1)
Abstract: LLMs struggle to handle long input sequences due to high memory and runtime costs. Memory-augmented models have emerged as a promising solution to this problem, but current methods are hindered by limited memory capacity and require costly re-training to integrate with a new LLM. In this work, we introduce an associative memory module which can be coupled to any pre-trained (frozen) attention-based LLM without re-training, enabling it to handle arbitrarily long input sequences. Unlike previous methods, our associative memory module consolidates representations of individual tokens into a non-parametric distribution model, dynamically managed by properly balancing the novelty and recency of the incoming data. By retrieving information from this consolidated associative memory, the base LLM can achieve significant (up to 29.7% on Arxiv) perplexity reduction in long-context modeling compared to other baselines evaluated on standard benchmarks. This architecture, which we call CAMELoT (Consolidated Associative Memory Enhanced Long Transformer), demonstrates superior performance even with a tiny context window of 128 tokens, and also enables improved in-context learning with a much larger set of demonstrations.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Language models are few-shot learners, 2020.
- Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
- Scaling instruction-finetuned language models, 2022.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
- Dudai, Y. The neurobiology of consolidations, or, how stable is the engram? Annu. Rev. Psychol., 55:51–86, 2004.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023.
- Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
- Structured prompting: Scaling in-context learning to 1,000 examples. arXiv preprint arXiv:2212.06713, 2022.
- Medeval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 8725–8744, 2023.
- Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
- Billion-scale similarity search with gpus, 2017.
- Kanerva, P. Sparse distributed memory. MIT press, 1988.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Kohonen, T. Associative memory: A system-theoretical approach, volume 17. Springer Science & Business Media, 2012.
- Dense associative memory for pattern recognition. Advances in neural information processing systems, 29, 2016.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- Pointer sentinel mixture models, 2016.
- Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467, 2023.
- Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560, 2023.
- Memoria: Hebbian memory architecture for human-like sequential processing. arXiv preprint arXiv:2310.03052, 2023.
- Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020.
- Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Compressive transformers for long-range sequence modelling. arXiv preprint, 2019a. URL https://arxiv.org/abs/1911.05507.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019b.
- Stand-alone self-attention in vision models. Advances in neural information processing systems, 32, 2019.
- Hopfield networks is all you need. In International Conference on Learning Representations, 2021.
- Sara, S. J. Retrieval and reconsolidation: toward a neurobiology of remembering. Learning & memory, 7(2):73–84, 2000.
- Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
- Biological learning in key-value memory networks. Advances in Neural Information Processing Systems, 34:22247–22258, 2021.
- Analyzing the structure of attention in a transformer language model. ACL 2019, pp. 63, 2019.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174, 2023.
- Memory networks. arXiv preprint arXiv:1410.3916, 2014.
- Non-holographic associative memory. Nature, 222(5197):960–962, 1969.
- Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Zexue He (23 papers)
- Leonid Karlinsky (79 papers)
- Donghyun Kim (129 papers)
- Julian McAuley (238 papers)
- Dmitry Krotov (28 papers)
- Rogerio Feris (105 papers)