- The paper introduces a 16x compression method, MEMORY-VQ, that significantly reduces storage overhead for retrieval-augmented models.
- It applies vector quantization variational autoencoders to compress token representations from 8KB to 512 bytes.
- Experiments show a negligible performance drop on the KILT benchmark, enabling scalable inference for internet-scale data.
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
The paper "MEMORY-VQ: Compression for Tractable Internet-Scale Memory" by Yury Zemlyanskiy et al. introduces a novel approach, MEMORY-VQ, to address the significant storage overhead in memory-based retrieval augmentation methods for LLMs. The work builds on existing models like LUMEN, which improve inference speed through pre-computation but suffer from prohibitive storage demands.
Problem Context
Retrieval augmentation is an effective technique to enhance the factual knowledge of LLMs by providing additional context via relevant text passages. However, it traditionally incurs high computational and storage costs. LUMEN is highlighted as a method that partially pre-computes encoder representations, thus offering faster inference at the cost of exponentially increased storage requirements. Specifically, LUMEN's representations require up to 8KB per token, which results in substantial storage burdens when applied at internet scale. For example, storing LUMEN token representations for a 1 trillion token corpus necessitates impractical 7PB of storage.
Proposed Solution: MEMORY-VQ
MEMORY-VQ is introduced as a compression method leveraging vector quantization variational autoencoders (VQ-VAE) to significantly reduce storage needs without sacrificing model performance on tasks like those in the KILT benchmark. By achieving a 16x compression ratio, MEMORY-VQ reduces the storage requirement of LUMEN from 30TB to 2TB for Wikipedia-scale data, and from 7PB to 0.5PB for a larger corpus. The methodology involves using product quantization to represent high-dimensional token vectors as integer codes, effectively lowering the data footprint.
Technical Approach
The compression is applied through vector quantization, a classical compression approach where high-dimensional vectors are represented by the closest code in a pre-defined codebook, reducing each token's representation from 8192 bytes to 512 bytes. The VQ-VAE framework employed ensures joint training of these compression layers with the rest of the model, facilitating effective adaptation to compression-induced representation changes.
Experimental Results
Experiments reveal that MEMORY-VQ maintains competitive performance compared to non-compressed counterparts, with a negligible drop in quality on the KILT benchmark. When compared to baseline approaches such as scaling down model size or partial retrieval strategies, MEMORY-VQ demonstrates superior trade-offs between storage compression and task performance, losing roughly 0.2 percentage points in average match scores at a 16x compression rate.
Implications and Future Work
The introduction of MEMORY-VQ opens up practical applications for retrieval-augmented models across large-scale datasets by addressing the critical scalability issue posed by storage constraints. The proposed approach is particularly relevant for scenarios requiring fast inference over massive document corpuses. Future work could explore broadening the applicability of MEMORY-VQ beyond LUMEN to cover a wide array of memory-augmented models. Additionally, further investigation into more efficient quantization techniques or hybrid methods combining other compression strategies could further optimize the storage-performance trade-off.
In summarizing, MEMORY-VQ represents a significant contribution to the landscape of scalable LLMs by bridging the gap between high-speed retrieval augmentation and feasible storage requirements.