Papers
Topics
Authors
Recent
Search
2000 character limit reached

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

Published 21 Feb 2025 in cs.AR, cs.AI, cs.DC, and cs.LG | (2502.15470v2)

Abstract: LLMs are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8$\times$ and 11.1$\times$ speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.

Summary

  • The paper introduces PAPI, a novel architecture that dynamically schedules tasks for optimal LLM decoding.
  • It leverages both compute-optimized and memory-optimized PIM units to efficiently manage heterogeneous workloads.
  • Performance tests reveal up to 1.8x speedup and 3.4x energy gains over traditional GPU systems.

PAPI: Dynamic Parallelism in LLM Decoding

The paper "PAPI: Exploiting Dynamic Parallelism in LLM Decoding with a Processing-In-Memory-Enabled Computing System" explores advancements in processing-in-memory (PIM) architectures, particularly applied to LLM decoding. This work introduces PAPI, a novel system architecture that improves LLM decoding performance by exploiting dynamic scheduling and characteristic matching of memory-bound and compute-bound tasks on specialized hardware units.

Background and Challenges

LLMs like GPT-3, due to their substantial parameter count, pose significant demands on computational and memory resources during inference, primarily in the decoding phase. Traditional architectures, which rely heavily on GPU implementations, struggle to efficiently manage the dynamically changing workloads typical in real-world applications. These inefficiencies are compounded when using static scheduling approaches that misallocate compute- or memory-bound operations, leading to underutilized resources.

Past approaches included speculative decoding and batching as methods to parallelize and improve throughput, yet these often led to static designs that did not adapt to the dynamic nature of decoding workloads, resulting in suboptimal performance.

PAPI Architecture

PAPI proposes a heterogeneous architecture levered with both PIM and traditional computation-centric processors like GPUs. It features a specific focus on dynamic parallelism-aware task scheduling that intelligently maps tasks to the appropriate resources based on their real-time characteristics.

The architecture introduces two types of PIM units:

  1. FC-PIM Units: Optimized for compute-intensive tasks by enhancing computation parallelism capabilities, while conforming to power constraints (Figure 1(b)).
  2. Attn-PIM Units: Tailored for memory-intensive attention kernels, emphasizing higher memory throughput due to the low arithmetic intensity of such tasks. Figure 2

    Figure 2: Overview of the PAPI computing system, highlighting its dynamic parallelism-aware scheduler.

The scheduler within PAPI dynamically assesses task characteristics such as arithmetic intensity and available parallelism levels. This assessment informs decisions on routing tasks to either PIM units or traditional GPU cores, thus mitigating data movement overheads and optimizing execution efficiency.

Performance Evaluation

Extensive evaluations on datasets like Dolly reveal that PAPI significantly outperforms existing architectures. Results show PAPI achieving up to 1.8x speedup over GPU-accelerated systems for substantial batch sizes (Figure 3). These performance gains are primarily attributed to the dynamic scheduling framework that allocates resources more effectively, reducing execution time bottlenecks in memory-bound scenarios. Figure 3

Figure 3: Normalized comparisons of speedup and energy efficiency among four different systems using the Dolly dataset.

Energy Efficiency

A notable advantage of PAPI is its energy efficiency, which is enhanced due to PIM's inherent ability to reduce data movement, which is a critical contributor to power consumption. Compared to GPU-only solutions, PAPI demonstrates a 3.4x improvement in energy efficiency, showcasing its sustainability in handling large-scale machine learning tasks.

Conclusion

PAPI sets a precedent for leveraging heterogeneous architectures with dynamic scheduling to address the limitations of traditional LLM inference systems. It emphasizes the potential for integrating PIM to manage memory-bound workloads and suggests new opportunities for architectural innovation in AI models with similar characteristics. The implications of this research extend beyond LLMs, proposing a reusable model for other architectures burdened by similar computational and memory challenges. Future research could expand on addressing even finer granularity in task parallelism or integrating emerging PIM technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 7 likes about this paper.