LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System (2412.20166v2)

Published 28 Dec 2024 in cs.AR and cs.AI

Abstract: The expansion of LLMs with hundreds of billions of parameters presents significant challenges to computational resources, particularly data movement and memory bandwidth. Long-context LLMs, which process sequences of tens of thousands of tokens, further increase the demand on the memory system as the complexity in attention layers and key-value cache sizes is proportional to the context length. Processing-in-Memory (PIM) maximizes memory bandwidth by moving compute to the data and can address the memory bandwidth challenges; however, PIM is not necessarily scalable to accelerate long-context LLM because of limited per-module memory capacity and the inflexibility of fixed-functional unit PIM architecture and static memory management. In this work, we propose LoL-PIM which is a multi-node PIM architecture that accelerates long context LLM through hardware-software co-design. In particular, we propose how pipeline parallelism can be exploited across a multi-PIM module while a direct PIM access (DPA) controller (or DMA for PIM) is proposed that enables dynamic PIM memory management and results in efficient PIM utilization across a diverse range of context length. We developed an MLIR-based compiler for LoL-PIM extending a commercial PIM-based compiler where the software modifications were implemented and evaluated, while the hardware changes were modeled in the simulator. Our evaluations demonstrate that LoL-PIM significantly improves throughput and reduces latency for long-context LLM inference, outperforming both multi-GPU and GPU-PIM systems (up to 8.54x and 16.0x speedup, respectively), thereby enabling more efficient deployment of LLMs in real-world applications.

Summary

The paper introduces LoL-PIM, a multi-node Processing-in-Memory architecture that achieves up to 8.54× throughput improvement in long-context LLM decoding.
It employs intra-module token-parallel partitioning and a Direct PIM Access controller to optimize memory bandwidth and reduce inference latency.
I/O-aware ping-pong buffering overlaps data transfers with computation, yielding a 4.74× speedup over conventional PIM systems.

Overview of LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System

The rapid evolution of LLMs, particularly those designed for processing long-context sequences, has invigorated research into efficient computation architectures that can handle the extensive resource demands these models impose. The paper "LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System" presents a novel architecture designed to address the challenges of handling extensive token sequences in LLMs, specifically focusing on improving memory bandwidth utilization and reducing latency during inference.

Key Contributions

The research introduces LoL-PIM, a multi-node Processing-in-Memory (PIM) architecture that effectively processes long context LLMs, which traditionally require substantial memory bandwidth due to the extensive key-value (KV) cache required in attention mechanisms. This architecture is built upon several key innovations:

PIM-aware Partitioning: The research advances a new workload partitioning strategy, termed intra-module token-parallel partitioning (ITPP), which optimizes the distribution of weight parameters and KV caches across PIM modules. By focusing on balancing workload distribution within PIMs, ITPP maximizes internal memory bandwidth utilization, a significant enhancement over existing methods.
Dynamic PIM Memory Management: The incorporation of a Direct PIM Access (DPA) controller facilitates dynamic memory allocation within PIM modules, allowing resource allocation to be adjusted dynamically according to the variability in input context lengths. This feature helps increase the average batch size for pipeline parallelization, resulting in improved throughput and efficiency compared to static allocation strategies.
I/O-aware Buffering: To address the new bottleneck of I/O data movement, LoL-PIM employs a ping-pong buffer technique, which enables overlapping data transfers with computation to mitigate transfer overhead. This innovation significantly reduces latency associated with data movements in and out of PIM modules.

Performance Evaluation

The architecture of LoL-PIM was evaluated on various long-context LLMs, including models with sizes ranging from 7 billion to 72 billion parameters. Key findings from the evaluation include:

LoL-PIM achieves up to 8.54× improvements in throughput over standard GPU-based architectures and offers a 4.74× speedup against commercial PIM systems when processing sequences of up to 32K tokens.
In heterogeneous configurations, combining GPUs and PIM, LoL-PIM outperforms traditional systems by optimizing layer execution and reducing overheads related to memory-bound operations.
The system architecture is shown to handle larger batch sizes more effectively due to its dynamic memory management and efficient data partitioning strategies.

Implications and Future Directions

LoL-PIM addresses critical bottlenecks in long-context LLM inference by offering a system architecture that can scale with the increasing demands of modern AI workloads. Its innovations in memory management and workload partitioning highlight a pathway towards more efficient LLM deployment strategies, particularly in contexts where memory bandwidth is a prime concern.

Further advancements could explore extending these techniques to a broader set of applications beyond LLM inference, such as real-time data analytics and broader AI-driven simulations. Additionally, refining model parallelism strategies to further exploit PIM capabilities could yield even greater performance improvements, potentially enabling even larger context windows and models to be efficiently processed within existing hardware constraints. As LLMs continue to evolve, architectures like LoL-PIM will be instrumental in adapting to these changing computational paradigms by offering scalable and efficient solutions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Underfox3/status/1874672207861018631