LoMA: Lossless Compressed Memory Attention (2401.09486v2)

Published 16 Jan 2024 in cs.LG and cs.CL

Abstract: LLMs face limitations due to the high demand on GPU memory and computational resources when handling long contexts. While sparsify the Key-Value (KV) cache of transformer model is a typical strategy to alleviate resource usage, it unavoidably results in the loss of information. We introduce Lossless Compressed Memory Attention (LoMA), a novel approach that enables lossless compression of the KV cache, thereby reducing the memory and computational demands during autoregressive generation. LoMA incorporates a specialized training or fine-tuning precedure alongside an autoregressive generation algorithm optimized for the compressed context. Our method compresses the KV cache after every $tc$ generated tokens with a compression ratio of $c$ and a target compressed length $t$, and this process occurs within a single inference pass without dependency on auxiliary models. We engineered an efficient training scheme involving specific inputs, attention masks, and position identifiers to instill this compression capability. Experimental validation has demonstrated that LoMA significantly reducing computational consumption and memory usage through achieving lossless KV cache compression.

PDF HTML Abstract

Introduction

As LLMs continue to traverse the frontiers of NLP, the efficient management of extensive textual data has become pivotal. This efficiency is crucial for tasks involving prolonged contexts where capturing the intricacies of dialogue, document comprehension, and information retrieval becomes challenging. Several methods aimed at compressing the key-value (KV) cache have been proposed to mitigate resource consumption, but they often lead to lossy compression—resulting in the degradation of vital information.

Lossless Compressed Memory Attention (LoMA)

Against this backdrop, we introduce the Lossless Compressed Memory Attention (LoMA), a novel framework for lossless compression of information into special memory token KV pairs adhering to a fixed compression ratio. LoMA restructures input tokens into segments with dedicated reading, memory, and repetition areas, each serving a unique function within the model’s mechanism. The attention matrix mask in LoMA is meticulously crafted, allowing for an intricate interplay amongst the different segments, which enhances learning efficiency during backpropagation without loss of information.

What distinguishes LoMA from previous methods is its capacity to provide lossless compression, maximizing the utilization of long-context data by the model. This is validated through impressive experimental outcomes, highlighting LoMA's ability to efficiently train the Llama-2-7B model on a lossless memory compression ratio of 4:1.

Methodology

The methodology behind LoMA is intricate yet elegant. LoMA interjects parallel compression in the autoregressive inference process, innovatively segmenting tokens and applying a custom attention matrix mask. Crucially, the fine-tuning process strengthens the model’s ability to generate text from highly compressed data without necessitating additional annotated data or alterations to the architecture. The fine-tuning strategy is underpinned by a dual-component loss function, reflecting both the LLM loss and the repetition zone loss, to ensure comprehensive learning.

Furthermore, attention masks and positional embedding within LoMA are carefully calibrated to maintain the integrity of auto-regressive properties and memory zone associations. Gradient analysis within LoMA's architecture indicates that the absence of supervisory signals in the memory area does not impede the model's ability to learn memory compression capabilities.

Experimental Validation

The efficacy of LoMA is underscored by a series of experiments using the Llama-2-7B model, revealing that fine-tuning allows for rapid convergence and transferability to other datasets. Notably, the fine-tuning process requires only a minimal dataset, an advantage that simplifies model adaptation and scalability. Metrics such as repetition accuracy and token repetition accuracy were deployed to quantify the model's generalization ability, bearing out the strength of the LoMA methodology. The experiments also suggest that a lower number of required compressions correlates with higher inference accuracy.

Final Thoughts

In conclusion, LoMA presents a significant step forward in handling long text sequences in NLP applications. By delivering lossless compression of information, LoMA ensures that efficiency does not come at the expense of content integrity—an essential requirement for advanced language understanding and generation tasks. The ease of integration of LoMA into existing models, coupled with its generalizability and minimal dataset requirement for fine-tuning, makes it a robust and transformative addition to the repertoire of LLM techniques. Future work could explore integrating LoMA at the pretraining stage, potentially elevating the capabilities of LLMs in processing extensive textual data even further.