Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jenga: Effective Memory Management for Serving LLM with Heterogeneity (2503.18292v1)

Published 24 Mar 2025 in cs.DC

Abstract: LLMs are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns of modern LLM architectures introduces new challenges for memory allocation. In this paper, we present Jenga, a novel memory allocation framework for heterogeneous embeddings in LLMs. Jenga tackles two key challenges: (1) minimizing memory fragmentation when managing embeddings of different sizes, and (2) enabling flexible caching and eviction policies tailored to the specific token-dependency patterns of various layers. Jenga employs a two-level memory allocator, leveraging the least common multiple (LCM) of embedding sizes to optimize memory usage and providing APIs to express layer-specific caching logic to enhance memory reuse. We implemente Jenga on vLLM, a state-of-the-art LLM inference engine, and evaluate it with diverse LLMs, datasets, and GPU configurations. Evaluations show that Jenga improves GPU memory utilization by up to 79.6%, and increases serving throughput by up to 4.92x (1.80x on average).

The paper "Jenga: Effective Memory Management for Serving LLM with Heterogeneity" addresses the challenges arising from the heterogeneity in modern LLMs related to memory management during inference. As LLMs evolve, they integrate diverse embeddings and attention mechanisms that make efficient memory management crucial for improving GPU utilization and lowering operational costs in deployment environments.

Key Contributions:

  1. Challenges in Modern LLMs: The paper identifies two primary issues in current LLMs that impact memory management:
    • Memory Fragmentation: Heterogeneous embedding sizes lead to inefficient memory use and fragmentation. Traditional LLM architectures assumed fixed-size embeddings and uniform dependency patterns, which are not applicable to newer architectures that incorporate vision embeddings, sparse attention layers, and more.
    • Token-dependency Patterns: New architectures demonstrate various token-dependency patterns, making a one-size-fits-all memory allocation approach inadequate. For example, some layers use only a subset of prefix tokens for generating subsequent outputs, demanding specific caching and eviction strategies.
  2. Jenga Framework: The authors propose a novel two-level memory allocator framework:
    • LCM Allocator: It uses the least common multiple (LCM) of embedding sizes to optimize memory usage. This paves the way for efficient allocation while reducing fragmentation by ensuring that the size of memory allocation units (pages) is optimal for a range of heterogeneous embeddings.
    • Customized Caching: APIs are provided to define layer-specific caching logic, allowing models to express their unique token-dependency patterns effectively. This includes differentiating full-prefix dependencies from prefix-subset dependencies in memory management.
  3. Implementation on vLLM: Jenga is implemented on vLLM, a state-of-the-art LLM inference engine, and shows significant improvements in GPU memory utilization and serving throughput. The framework enhances performance by improving memory utilization by up to 79.6% and increasing serving throughput by up to 4.92 times compared to traditional memory management techniques.
  4. Evaluation: The paper evaluates Jenga using various LLMs, datasets, and GPU configurations, demonstrating robust improvements across heterogeneous model architectures. These include improvements in both throughput and memory efficiency without compromising the latency.

Practical Implications:

  • Real-World Deployment: Jenga enables cost-effective scalability of LLMs, allowing service providers to handle more traffic with fewer resources, thus reducing overhead costs.
  • Adaptation to Model Evolution: As architectures increasingly incorporate components like visual embeddings and more complex attention mechanisms, Jenga offers a flexible solution for tailored memory management across diverse model architectures.
  • Inference Efficiency: By addressing memory fragmentation and utilizing efficient caching policies, inference efficiency is significantly improved, making it viable for high-traffic, resource-intensive applications.

By approaching LLM memory management with a novel perspective on heterogeneity, Jenga provides a comprehensive solution that harmonizes modern model capabilities with practical deployment needs, illustrating a pathway toward optimized AI system performance in diverse operational settings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Chen Zhang (403 papers)
  2. Kuntai Du (14 papers)
  3. Shu Liu (146 papers)
  4. Woosuk Kwon (9 papers)
  5. Xiangxi Mo (12 papers)
  6. Yufeng Wang (43 papers)
  7. Xiaoxuan Liu (21 papers)
  8. Kaichao You (13 papers)
  9. Zhuohan Li (29 papers)
  10. Mingsheng Long (110 papers)
  11. Jidong Zhai (24 papers)
  12. Joseph Gonzalez (35 papers)
  13. Ion Stoica (177 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com