You Only Cache Once: Decoder-Decoder Architectures for Language Models (2405.05254v2)

Published 8 May 2024 in cs.CL

Abstract: We introduce a decoder-decoder architecture, YOCO, for LLMs, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes. Code is available at https://aka.ms/YOCO.

PDF HTML Abstract

Exploring YOCO: A Novel Approach to Efficient LLMing

Introduction to YOCO

The quest for creating more efficient and powerful LLMs has led to significant architectural innovations over the last few years. Among these, YOCO (You Only Cache Once) introduces a distinctive twist on managing memory usage and processing speed for LLMs. YOCO employs a decoder-decoder architecture which crucially caches key-value (KV) pairs just once, contrary to traditional approaches where caching is repeatedly done across several layers.

Architecture of YOCO

YOCO splits its architecture into two main components: the self-decoder and the cross-decoder:

Self-decoder: This part is responsible for initially processing the input sequence to produce KV caches. This is done through efficient self-attention mechanisms that are designed to be light on memory, helping alleviate the intense demand large models typically place on hardware.
Cross-decoder: Following the self-decoder, the cross-decoder then utilizes these pre-computed KV pairs to continue processing the sequence. By reusing the cached pairs, it avoids the redundant recomputation commonly seen in other models.

This split not only promotes efficiency but mimics the behavior of decoder-only models which are advantageous for tasks like autoregressive generation by maintaining a natural flow for generating output progressively.

Efficiency Gains

The innovative structure of YOCO allows it to boost efficiency across several fronts:

Memory Usage: The memory required for storing KV pairs is significantly reduced as these pairs are cached just once and shared across the decoders. This clever design cuts down the GPU memory requirements approximately by a factor correlating with the number of layers in the model.
Prefilling Speed: For long inputs, YOCO’s architecture enables a form of 'early exit' during prefilling, which drastically reduces the time taken to process initial tokens before generating output tokens. For instance, on a context length of 512K tokens, YOCO trimmed the prefill latency from 180 seconds to below 6 seconds compared to traditional Transformer models optimized with Flash-Decoding and kernel fusion.
Throughput and Serving Capacity: With reduced memory and quicker prefill times, YOCO can handle larger batch sizes and longer token sequences, ultimately improving throughput and the model's capacity to serve more tokens simultaneously.

Empirical Performance

When tested, YOCO demonstrated robust performance, holding up well against existing Transformer benchmarks. It showed capability in scaling with increased model sizes, training tokens, and was notably extended to handle up to 1 million token contexts with near-perfect needle retrieval accuracy—a challenging feat for many current models.

Theoretical and Practical Implications

The YOCO model architecture provides a compelling alternative for developing LLMs, especially in scenarios where memory and speed are bottlenecks. The approach blends well with the growing need for more agile and cost-effective models in practical applications, ranging from real-time language understanding to more complex multi-modal tasks where timing and response are crucial.

Speculations on Future Developments

The deployment of YOCO in settings beyond pure text-based models, such as in tasks involving multi-modal data (images, text, and audio), seems a promising avenue. Additionally, the underlying principles of YOCO could inspire further research into even more memory-efficient designs or specialized hardware implementations that could leverage its unique caching strategy.

As AI research continues to push the boundaries of what's possible with machine learning models, YOCO stands out as a valuable step towards more sustainable and scalable AI technologies. Its innovations in model architecture offer a glimpse into the future directions of AI systems, highlighting an ongoing shift towards optimization and efficiency without compromising on performance.