In-context Autoencoder for Context Compression in a Large Language Model (2307.06945v4)
Abstract: We propose the In-context Autoencoder (ICAE), leveraging the power of a LLM to compress a long context into short compact memory slots that can be directly conditioned on by the LLM for various purposes. ICAE is first pretrained using both autoencoding and LLMing objectives on massive text data, enabling it to generate memory slots that accurately and comprehensively represent the original context. Then, it is fine-tuned on instruction data for producing desirable responses to various prompts. Experiments demonstrate that our lightweight ICAE, introducing about 1% additional parameters, effectively achieves $4\times$ context compression based on Llama, offering advantages in both improved latency and GPU memory cost during inference, and showing an interesting insight in memorization as well as potential for scalability. These promising results imply a novel perspective on the connection between working memory in cognitive science and representation learning in LLMs, revealing ICAE's significant implications in addressing the long context problem and suggesting further research in LLM context management. Our data, code and models are available at https://github.com/getao/icae.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Alan Baddeley. Working memory. Science, 255(5044):556–559, 1992.
- Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
- Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062, 2023.
- Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Rethinking attention with performers. ArXiv, abs/2009.14794, 2020.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
- Working memory, short-term memory, and general fluid intelligence: a latent-variable approach. Journal of experimental psychology: General, 128(3):309, 1999.
- Acquisition of a memory skill. Science, 208(4448):1181–1182, 1980.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- “low-resource” text classification: A parameter-free classification method with compressors. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 6810–6828, 2023.
- Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48, 2020.
- Mark A. Kramer. Nonlinear principal component analysis using autoassociative neural networks. Aiche Journal, 37:233–243, 1991.
- Lost in the middle: How language models use long contexts, 2023.
- Routes to remembering: the brains behind superior memory. Nature neuroscience, 6(1):90–95, 2003.
- Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467, 2023.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Bleu: a method for automatic evaluation of machine translation. 10 2002. doi: 10.3115/1073083.1073135.
- Semiparametric language models are scalable continual learners. arXiv preprint arXiv:2303.01421, 2023.
- Nugget: Neural agglomerative embeddings of text. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 28337–28350. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/qin23a.html.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- Learning by distilling context. arXiv preprint arXiv:2209.15189, 2022.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. arXiv preprint arXiv:2210.03162, 2022.
- Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
- Linear complexity randomized self-attention mechanism. In International Conference on Machine Learning, 2022.