Long Context Compression with Activation Beacon (2401.03462v3)

Published 7 Jan 2024 in cs.CL and cs.AI

Abstract: Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts. To achieve this, our method introduces the following technical designs. 1) We directly compress the activations (i.e. keys and values at every layer), rather than leveraging soft prompts to relay information (which constitute a major bottleneck to encapsulate the complex information within long contexts). 2) We tailor the compression workflow, where each fine-grained input unit is progressively compressed, enabling high-quality compression and efficient computation during both training and inference. 3) We train the model through compression-based auto-regression, making full use of plain texts and instructional data to optimize the model's compression performance. 4) During training, we randomly sample a compression ratio at each step, teaching the model to support a wide range of compression configurations. Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache. Our data, model, and code have been released at \url{https://github.com/FlagOpen/FlagEmbedding/}.

PDF HTML Abstract

Introduction

LLMs have transformed our ability to automate natural language tasks. However, their effectiveness is often shackled by an intrinsic limitation - the ability to consider only a fixed, and relatively short, snippet of text at any given time. This limitation of context window size has been a persistent challenge, restricting the potential uses of LLMs in scenarios where understanding lengthy documents or conversations is crucial. To remedy this, researchers have traditionally resorted to fine-tuning or re-training models to handle longer contexts, a procedure that comes at great computational cost and potential compromise to the model's performance on shorter texts.

The Activation Beacon Approach

In a promising development, researchers have introduced a new methodology called "Activation Beacon", which targets the root of the context limitation problem. Taking cues from insights that LLM activations (the internal data representations the model uses) are information-dense, the Activation Beacon approach condenses these activations into a more compact form. The result? Even with a restricted window of attention, the LLM can access a broader range of context.

Activation Beacon works by inserting special tokens, known as "beacons", at intervals across the input data. These beacons actively condense information, allowing them to carry the essence of much larger text segments. This strategy not only increases the amount of textual content an LLM can consider but does so with remarkable efficiency and without affecting the performance on existing, shorter contexts.

Streamlined Training and Compatibility

A remarkable aspect of Activation Beacon is its ability to train efficiently on short-sequence data, consuming considerably less time and compute resources compared to methods that rely on extensive re-training. The beacons are introduced as a plug-and-play module atop a pre-existing LLM, keeping the original LLM parameters fixed. This approach retains model compatibility, letting Activation Beacon potentially extend its context-handling capabilities a hundredfold, effectively stretching a 4K context limit to a staggering 400K.

Empirical Validation

Through comprehensive experiments, the effectiveness of the Activation Beacon was assessed. The results showcased its prowess in extending the context window far beyond existing benchmarks without the extensive costs typically associated with such extensions. The model demonstrated superior LLMing and understanding over long contexts and maintained competitive processing speeds and memory efficiency. The paper confirmed that the Activation Beacon could effectively train using a multiplicity of condensing ratios, which diversify its application across varying context lengths.

Conclusion

In conclusion, Activation Beacon stands out as an inventive solution to the context window restriction in LLMs. It is a robust, scalable, and cost-effective module capable of significantly broadening the scope of contexts that LLMs can manage. Activation Beacon's plug-and-play nature coupled with its training efficiency opens up new horizons for longer-form LLMing and understanding tasks. Further, its compatibility ensures that existing LLM investments remain fruitful, adding yet another layer to the versatile applications of LLMs in modern computational linguistics.

PDF Markdown Bookmark Chat (Pro)

References (41)

Authors (6)

Peitian Zhang (23 papers)
Zheng Liu (312 papers)
Shitao Xiao (38 papers)
Ninglu Shao (9 papers)
Qiwei Ye (16 papers)
Zhicheng Dou (113 papers)

Citations (35)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - FlagOpen/FlagEmbedding: Dense Retrieval and Retrieval-augmented LLMs (5,521 stars)

Tweets

https://twitter.com/arankomatsuzaki/status/1744569534231597058

https://twitter.com/_akhaliq/status/1744567162847551715

https://twitter.com/rohanpaul_ai/status/1806772036125008087

https://twitter.com/rohanpaul_ai/status/1818044948966678610

https://twitter.com/fly51fly/status/1746651544932766195

https://twitter.com/arxivsanitybot/status/1744715018682401068