Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long Context Compression with Activation Beacon (2401.03462v3)

Published 7 Jan 2024 in cs.CL and cs.AI
Long Context Compression with Activation Beacon

Abstract: Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts. To achieve this, our method introduces the following technical designs. 1) We directly compress the activations (i.e. keys and values at every layer), rather than leveraging soft prompts to relay information (which constitute a major bottleneck to encapsulate the complex information within long contexts). 2) We tailor the compression workflow, where each fine-grained input unit is progressively compressed, enabling high-quality compression and efficient computation during both training and inference. 3) We train the model through compression-based auto-regression, making full use of plain texts and instructional data to optimize the model's compression performance. 4) During training, we randomly sample a compression ratio at each step, teaching the model to support a wide range of compression configurations. Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache. Our data, model, and code have been released at \url{https://github.com/FlagOpen/FlagEmbedding/}.

Introduction

LLMs have transformed our ability to automate natural language tasks. However, their effectiveness is often shackled by an intrinsic limitation - the ability to consider only a fixed, and relatively short, snippet of text at any given time. This limitation of context window size has been a persistent challenge, restricting the potential uses of LLMs in scenarios where understanding lengthy documents or conversations is crucial. To remedy this, researchers have traditionally resorted to fine-tuning or re-training models to handle longer contexts, a procedure that comes at great computational cost and potential compromise to the model's performance on shorter texts.

The Activation Beacon Approach

In a promising development, researchers have introduced a new methodology called "Activation Beacon", which targets the root of the context limitation problem. Taking cues from insights that LLM activations (the internal data representations the model uses) are information-dense, the Activation Beacon approach condenses these activations into a more compact form. The result? Even with a restricted window of attention, the LLM can access a broader range of context.

Activation Beacon works by inserting special tokens, known as "beacons", at intervals across the input data. These beacons actively condense information, allowing them to carry the essence of much larger text segments. This strategy not only increases the amount of textual content an LLM can consider but does so with remarkable efficiency and without affecting the performance on existing, shorter contexts.

Streamlined Training and Compatibility

A remarkable aspect of Activation Beacon is its ability to train efficiently on short-sequence data, consuming considerably less time and compute resources compared to methods that rely on extensive re-training. The beacons are introduced as a plug-and-play module atop a pre-existing LLM, keeping the original LLM parameters fixed. This approach retains model compatibility, letting Activation Beacon potentially extend its context-handling capabilities a hundredfold, effectively stretching a 4K context limit to a staggering 400K.

Empirical Validation

Through comprehensive experiments, the effectiveness of the Activation Beacon was assessed. The results showcased its prowess in extending the context window far beyond existing benchmarks without the extensive costs typically associated with such extensions. The model demonstrated superior LLMing and understanding over long contexts and maintained competitive processing speeds and memory efficiency. The paper confirmed that the Activation Beacon could effectively train using a multiplicity of condensing ratios, which diversify its application across varying context lengths.

Conclusion

In conclusion, Activation Beacon stands out as an inventive solution to the context window restriction in LLMs. It is a robust, scalable, and cost-effective module capable of significantly broadening the scope of contexts that LLMs can manage. Activation Beacon's plug-and-play nature coupled with its training efficiency opens up new horizons for longer-form LLMing and understanding tasks. Further, its compatibility ensures that existing LLM investments remain fruitful, adding yet another layer to the versatile applications of LLMs in modern computational linguistics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Ntk-aware scaled rope, 2023.
  2. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  3. Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020.
  4. Scaling transformer to 1m tokens and beyond with RMT. CoRR, abs/2304.11062, 2023.
  5. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  6. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023.
  7. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3829–3846. Association for Computational Linguistics, 2023.
  8. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  9. Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  10. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
  11. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691, 2023.
  12. Longnet: Scaling transformers to 1, 000, 000, 000 tokens. CoRR, abs/2307.02486, 2023.
  13. Lm-infinite: Simple on-the-fly length generalization for large language models. CoRR, abs/2308.16137, 2023.
  14. Long-range language modeling with selective cache. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 4838–4858. Association for Computational Linguistics, 2023.
  15. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  16. How long can open-source llms truly promise on context length?, June 2023.
  17. Christian Puhrsch Michael Gschwind, Driss Guessous. Accelerated pytorch 2 transformers. https://pytorch.org/blog/accelerated-pytorch-2/, 2023.
  18. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  19. Learning to compress prompts with gist tokens. CoRR, abs/2304.08467, 2023.
  20. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  21. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  22. Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  23. Combiner: Full attention transformer with sparse computation cost. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 22470–22482, 2021.
  24. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  25. Long-range language modeling with self-retrieval. CoRR, abs/2306.13421, 2023.
  26. Jianlin Su. Rectified rotary position embeddings. https://github.com/bojone/rerope, 2023.
  27. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021.
  28. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  31. Natural language processing with transformers, 2022.
  32. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
  33. Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768, 2020.
  34. Augmenting language models with long-term memory. CoRR, abs/2306.07174, 2023.
  35. Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  36. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  37. Retrieval meets long context large language models. CoRR, abs/2310.03025, 2023.
  38. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  39. Retrieve anything to augment large language models. CoRR, abs/2310.07554, 2023.
  40. Bartosz Piotrowski Zhangir Azerbayev, Edward Ayers. Proof-pile. https://huggingface.co/datasets/hoskinson-center/proof-pile, 2022.
  41. Pose: Efficient context window extension of llms via positional skip-wise training. CoRR, abs/2309.10400, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Peitian Zhang (23 papers)
  2. Zheng Liu (312 papers)
  3. Shitao Xiao (38 papers)
  4. Ninglu Shao (9 papers)
  5. Qiwei Ye (16 papers)
  6. Zhicheng Dou (113 papers)
Citations (35)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com