Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MEMORY-VQ: Compression for Tractable Internet-Scale Memory (2308.14903v1)

Published 28 Aug 2023 in cs.CL

Abstract: Retrieval augmentation is a powerful but expensive method to make LLMs more knowledgeable about the world. Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our method uses a vector quantization variational autoencoder (VQ-VAE) to compress token representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a memory model that achieves a 16x compression rate with comparable performance on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even for extremely large retrieval corpora.

Summary

  • The paper introduces a 16x compression method, MEMORY-VQ, that significantly reduces storage overhead for retrieval-augmented models.
  • It applies vector quantization variational autoencoders to compress token representations from 8KB to 512 bytes.
  • Experiments show a negligible performance drop on the KILT benchmark, enabling scalable inference for internet-scale data.

MEMORY-VQ: Compression for Tractable Internet-Scale Memory

The paper "MEMORY-VQ: Compression for Tractable Internet-Scale Memory" by Yury Zemlyanskiy et al. introduces a novel approach, MEMORY-VQ, to address the significant storage overhead in memory-based retrieval augmentation methods for LLMs. The work builds on existing models like LUMEN, which improve inference speed through pre-computation but suffer from prohibitive storage demands.

Problem Context

Retrieval augmentation is an effective technique to enhance the factual knowledge of LLMs by providing additional context via relevant text passages. However, it traditionally incurs high computational and storage costs. LUMEN is highlighted as a method that partially pre-computes encoder representations, thus offering faster inference at the cost of exponentially increased storage requirements. Specifically, LUMEN's representations require up to 8KB per token, which results in substantial storage burdens when applied at internet scale. For example, storing LUMEN token representations for a 1 trillion token corpus necessitates impractical 7PB of storage.

Proposed Solution: MEMORY-VQ

MEMORY-VQ is introduced as a compression method leveraging vector quantization variational autoencoders (VQ-VAE) to significantly reduce storage needs without sacrificing model performance on tasks like those in the KILT benchmark. By achieving a 16x compression ratio, MEMORY-VQ reduces the storage requirement of LUMEN from 30TB to 2TB for Wikipedia-scale data, and from 7PB to 0.5PB for a larger corpus. The methodology involves using product quantization to represent high-dimensional token vectors as integer codes, effectively lowering the data footprint.

Technical Approach

The compression is applied through vector quantization, a classical compression approach where high-dimensional vectors are represented by the closest code in a pre-defined codebook, reducing each token's representation from 8192 bytes to 512 bytes. The VQ-VAE framework employed ensures joint training of these compression layers with the rest of the model, facilitating effective adaptation to compression-induced representation changes.

Experimental Results

Experiments reveal that MEMORY-VQ maintains competitive performance compared to non-compressed counterparts, with a negligible drop in quality on the KILT benchmark. When compared to baseline approaches such as scaling down model size or partial retrieval strategies, MEMORY-VQ demonstrates superior trade-offs between storage compression and task performance, losing roughly 0.2 percentage points in average match scores at a 16x compression rate.

Implications and Future Work

The introduction of MEMORY-VQ opens up practical applications for retrieval-augmented models across large-scale datasets by addressing the critical scalability issue posed by storage constraints. The proposed approach is particularly relevant for scenarios requiring fast inference over massive document corpuses. Future work could explore broadening the applicability of MEMORY-VQ beyond LUMEN to cover a wide array of memory-augmented models. Additionally, further investigation into more efficient quantization techniques or hybrid methods combining other compression strategies could further optimize the storage-performance trade-off.

In summarizing, MEMORY-VQ represents a significant contribution to the landscape of scalable LLMs by bridging the gap between high-speed retrieval augmentation and feasible storage requirements.

Youtube Logo Streamline Icon: https://streamlinehq.com