Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In-context Autoencoder for Context Compression in a Large Language Model (2307.06945v4)

Published 13 Jul 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We propose the In-context Autoencoder (ICAE), leveraging the power of a LLM to compress a long context into short compact memory slots that can be directly conditioned on by the LLM for various purposes. ICAE is first pretrained using both autoencoding and LLMing objectives on massive text data, enabling it to generate memory slots that accurately and comprehensively represent the original context. Then, it is fine-tuned on instruction data for producing desirable responses to various prompts. Experiments demonstrate that our lightweight ICAE, introducing about 1% additional parameters, effectively achieves $4\times$ context compression based on Llama, offering advantages in both improved latency and GPU memory cost during inference, and showing an interesting insight in memorization as well as potential for scalability. These promising results imply a novel perspective on the connection between working memory in cognitive science and representation learning in LLMs, revealing ICAE's significant implications in addressing the long context problem and suggesting further research in LLM context management. Our data, code and models are available at https://github.com/getao/icae.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  2. Alan Baddeley. Working memory. Science, 255(5044):556–559, 1992.
  3. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
  6. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062, 2023.
  7. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
  8. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  9. Rethinking attention with performers. ArXiv, abs/2009.14794, 2020.
  10. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  11. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023.
  12. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  13. Working memory, short-term memory, and general fluid intelligence: a latent-variable approach. Journal of experimental psychology: General, 128(3):309, 1999.
  14. Acquisition of a memory skill. Science, 208(4448):1181–1182, 1980.
  15. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  17. “low-resource” text classification: A parameter-free classification method with compressors. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  6810–6828, 2023.
  18. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  39–48, 2020.
  19. Mark A. Kramer. Nonlinear principal component analysis using autoassociative neural networks. Aiche Journal, 37:233–243, 1991.
  20. Lost in the middle: How language models use long contexts, 2023.
  21. Routes to remembering: the brains behind superior memory. Nature neuroscience, 6(1):90–95, 2003.
  22. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467, 2023.
  23. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  25. Bleu: a method for automatic evaluation of machine translation. 10 2002. doi: 10.3115/1073083.1073135.
  26. Semiparametric language models are scalable continual learners. arXiv preprint arXiv:2303.01421, 2023.
  27. Nugget: Neural agglomerative embeddings of text. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  28337–28350. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/qin23a.html.
  28. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  29. Learning by distilling context. arXiv preprint arXiv:2209.15189, 2022.
  30. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  32. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  33. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  34. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  35. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  36. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. arXiv preprint arXiv:2210.03162, 2022.
  37. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  38. Linear complexity randomized self-attention mechanism. In International Conference on Machine Learning, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Tao Ge (53 papers)
  2. Jing Hu (50 papers)
  3. Lei Wang (975 papers)
  4. Xun Wang (96 papers)
  5. Si-Qing Chen (22 papers)
  6. Furu Wei (291 papers)
Citations (54)

Summary

In-context Autoencoder for Context Compression in a LLM

The paper "In-context Autoencoder for Context Compression in a LLM" introduces the In-context Autoencoder (ICAE) as a novel approach to address the inherent limitations that LLMs face when processing long contexts, particularly due to the self-attention mechanism common to Transformer-based models. ICAE is designed to compress long contexts into shorter, memory-efficient representations called memory slots, which are then utilized by the LLM for various tasks. This approach is particularly beneficial in reducing inference latency and GPU memory cost, thus providing a practical solution to handling extensive sequences without the need for extensive architectural changes to the LLM itself.

ICAE is characterized by its two-phase training process: a pretraining phase and an instruction fine-tuning phase. In the pretraining phase, ICAE is optimized using both autoencoding (AE) and LLMing (LM) objectives on a vast corpus of text data. This dual-objective strategy allows the model to learn to generate memory slots that encapsulate the original context with high fidelity, ensuring the LLM can either reconstruct the original input or generate meaningful continuations. Upon pretraining, the model is fine-tuned with instruction data, aimed at adapting the memory slots to interact effectively with diverse prompts, thereby enabling practical deployment scenarios.

Numerical results presented in the paper are compelling, with ICAE achieving a 4x reduction in context length when tested on the Llama model, while adding merely 1% additional parameters to the model. Such efficiency demonstrates ICAE's potential for significant reductions in computation and memory overhead during inference, which is a critical consideration in deployment scenarios involving real-world text processing tasks.

The experimental evaluation of ICAE also explored variants of memory slot lengths, demonstrating that increases in compression ratio result in deteriorated performance, primarily in terms of restored context fidelity. Still, the system showcases a promising ability to handle typical linguistic inputs robustly, as evidenced by its autoencoding and text continuation results. Furthermore, the analysis revealed that pretrained ICAE suffered less from hallucination compared to variants without pretraining, highlighting the importance of the extensive self-supervised learning phase in enhancing the model's context compression capabilities.

An intriguing aspect of ICAE lies in its potential bridge between cognitive science concepts of working memory and machine learning paradigms in LLMs. By providing a means to assess the memorization patterns of LLMs, ICAE opens a window into understanding how these models manage and retain information, paralleling research in human memory and learning processes. This conceptual association not only advances our understanding of LLMs' internal mechanisms but also suggests pathways for future research in context management and memory in AI systems.

The implications of ICAE's advancements are dual-faceted: practical and theoretical. On the practical side, ICAE's context compression can streamline AI systems handling tasks with inherently long contexts, such as Retrieval Augmented Generation and advanced prompting techniques. The reduction in required computational resources is particularly beneficial for scalability, facilitating the deployment of LLMs in resource-constrained environments. Theoretically, ICAE sets a precedent for exploring the domains of memory representation and context dynamics in LLMs, providing a fertile ground for further investigation.

Looking to the future, ICAE presents several research opportunities. One avenue is scaling up ICAE to test its efficacy with larger and more powerful LLMs, which could potentially enhance the compression ratio without sacrificing performance. Additionally, extending ICAE to multimodal contexts, involving image, audio, and video data, could unify representations across modalities, offering a comprehensive approach to context management in AI systems. This could lead to innovations in how multimodal data are processed by AI, bridging the gap between different types of input data and compressing them into unified, concise memory representations.

Overall, the development of ICAE marks an important step in overcoming the constraints posed by long contexts in LLMs, providing an efficient and scalable methodology for context compression that aligns well with both practical application and theoretical exploration.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com