Marconi: Prefix Caching for the Era of Hybrid LLMs (2411.19379v3)

Published 28 Nov 2024 in cs.DC, cs.AI, and cs.LG

Abstract: Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in LLM serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Marconi, a novel system that optimizes prefix caching for hybrid LLMs by strategically differentiating between input-only and input-output sequences.
It employs a FLOP-aware eviction policy that balances cache occupancy with compute savings, achieving token hit rates up to 34.4× higher than prior methods.
Evaluation shows that Marconi reduces inference latency by up to 71.1%, underscoring its potential to enhance efficiency in large-scale LLM deployments.

Marconi: Prefix Caching for the Era of Hybrid LLMs

The paper "Marconi: Prefix Caching for the Era of Hybrid LLMs" addresses the challenges inherent in the deployment and efficient serving of Hybrid LLMs. The authors present Marconi, a novel system engineered to facilitate efficient prefix caching within Hybrid LLM architectures. This paper delineates the complexities associated with prefix caching for Hybrid models and proposes an innovative solution to enhance computational efficiency.

Overview and Motivation

Hybrid models, which integrate Attention layers with Recurrent layers, have emerged as crucial in managing long context sequences typical of contemporary LLM tasks. The computational complexity of traditional Attention mechanisms is quadratic, posing challenges in inference scalability due to high memory demands and processing inefficiencies. The introduction of Recurrent layers, specifically State Space Models (SSMs), offers subquadratic compute capabilities, effectively reducing the memory overhead.

However, these benefits complicate traditional efficiency strategies such as prefix caching due to SSM's unique state-update properties. In typical configurations incorporating SSMs, states are updated in-place, inhibiting the rollback necessary for partial reuse and mandating exact-match cache hits. This results in cache saturation with entries that have limited reuse potential, leading to cache thrashing and underutilization.

Marconi Design

Marconi brings several novel contributions to the table by rethinking cache management tasks in Hybrid LLMs. Central to Marconi are its judicious admission and eviction policies, which are informed by a deeper understanding of potential reuse scenarios and an awareness of the compute-to-memory savings tradeoff.

Admission Strategy

Marconi identifies potential cache entries by evaluating request redundancies and reuse patterns. The primary strategy differentiates between "purely input" prefixes, shared across requests, and "input and output" sequences. This distinction underpins Marconi’s decision to cache high-value entries, focusing on those with high reuse likelihood without redundantly caching massive SSM states unnecessarily. The radix tree structure assists in efficiently capturing and replaying prefix insertion behaviors to ensure high-value cache states are identified and retained.

Eviction Strategy

The paper introduces a FLOP-aware eviction mechanism, which departs from traditional LRU policies by incorporating computational savings into the eviction calculus. By evaluating cache entries not only on recency but also on their contribution to computational workload reduction, Marconi can more strategically manage cache space, focusing on maintaining entries with higher compute-to-storage efficiency. This approach balances cache occupancy with performance overhead, demonstrating significant savings specifically for sequences with extensive token lengths.

Evaluation and Results

The evaluation indicates that Marconi can achieve token hit rates that are 4.5 to 34.4 times higher than those from previous state-of-the-art systems such as vLLM with its default configuration. This translates directly to latency reductions—up to 71.1% in TTFT—underscoring Marconi's utility in real-world, compute-bound scenarios that demand efficient LLM serving.

Implications and Future Directions

The findings suggest substantial implications for large-scale AI deployment scenarios, underscoring the need for more sophisticated cache management in serving infrastructures supporting LLMs. As LLM capabilities continue to expand and models evolve, incorporating larger context windows and more intricate interaction patterns, systems like Marconi could provide essential infrastructure support to ensure efficient, scalable AI services.

Looking forward, additional research might focus on optimizing cache admission and eviction policies further, potentially integrating adaptive learning mechanisms to refine reuse predictions based on live traffic patterns. Additionally, the development of cross-architecture compatibility, extending beyond standard Hybrid LLMs, could expand Marconi’s applicability to newer model architectures that may arise in the evolving AI landscape.

This paper thus sets a benchmark for future explorations into efficient serving of complex LLM systems, presenting a well-defined approach to addressing computational and storage inefficiencies in state-of-the-art models.