The paper "Jenga: Effective Memory Management for Serving LLM with Heterogeneity" addresses the challenges arising from the heterogeneity in modern LLMs related to memory management during inference. As LLMs evolve, they integrate diverse embeddings and attention mechanisms that make efficient memory management crucial for improving GPU utilization and lowering operational costs in deployment environments.
Key Contributions:
- Challenges in Modern LLMs: The paper identifies two primary issues in current LLMs that impact memory management:
- Memory Fragmentation: Heterogeneous embedding sizes lead to inefficient memory use and fragmentation. Traditional LLM architectures assumed fixed-size embeddings and uniform dependency patterns, which are not applicable to newer architectures that incorporate vision embeddings, sparse attention layers, and more.
- Token-dependency Patterns: New architectures demonstrate various token-dependency patterns, making a one-size-fits-all memory allocation approach inadequate. For example, some layers use only a subset of prefix tokens for generating subsequent outputs, demanding specific caching and eviction strategies.
- Jenga Framework: The authors propose a novel two-level memory allocator framework:
- LCM Allocator: It uses the least common multiple (LCM) of embedding sizes to optimize memory usage. This paves the way for efficient allocation while reducing fragmentation by ensuring that the size of memory allocation units (pages) is optimal for a range of heterogeneous embeddings.
- Customized Caching: APIs are provided to define layer-specific caching logic, allowing models to express their unique token-dependency patterns effectively. This includes differentiating full-prefix dependencies from prefix-subset dependencies in memory management.
- Implementation on vLLM: Jenga is implemented on vLLM, a state-of-the-art LLM inference engine, and shows significant improvements in GPU memory utilization and serving throughput. The framework enhances performance by improving memory utilization by up to 79.6% and increasing serving throughput by up to 4.92 times compared to traditional memory management techniques.
- Evaluation: The paper evaluates Jenga using various LLMs, datasets, and GPU configurations, demonstrating robust improvements across heterogeneous model architectures. These include improvements in both throughput and memory efficiency without compromising the latency.
Practical Implications:
- Real-World Deployment: Jenga enables cost-effective scalability of LLMs, allowing service providers to handle more traffic with fewer resources, thus reducing overhead costs.
- Adaptation to Model Evolution: As architectures increasingly incorporate components like visual embeddings and more complex attention mechanisms, Jenga offers a flexible solution for tailored memory management across diverse model architectures.
- Inference Efficiency: By addressing memory fragmentation and utilizing efficient caching policies, inference efficiency is significantly improved, making it viable for high-traffic, resource-intensive applications.
By approaching LLM memory management with a novel perspective on heterogeneity, Jenga provides a comprehensive solution that harmonizes modern model capabilities with practical deployment needs, illustrating a pathway toward optimized AI system performance in diverse operational settings.