Papers
Topics
Authors
Recent
Search
2000 character limit reached

WaferLLM: Large Language Model Inference at Wafer Scale

Published 6 Feb 2025 in cs.LG, cs.AI, cs.AR, cs.DC, and cs.ET | (2502.04563v3)

Abstract: Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to exploit these accelerators fully. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves up to 200$\times$ higher accelerator utilization than state-of-the-art methods. Leveraging a wafer-scale accelerator (Cerebras WSE2), WaferLLM delivers GEMV operations 606$\times$ faster and 16$\times$ more energy-efficient than on an NVIDIA A100 GPU. For full LLM inference, WaferLLM achieves 10-20$\times$ speedups over A100 GPU clusters running SGLang and vLLM. These advantages are expected to grow as wafer-scale AI models, software, and hardware continue to mature. WaferLLM is open-sourced at https://github.com/MeshInfra/WaferLLM.

Summary

  • The paper introduces WaferLLM, a novel system leveraging a custom PLMR model to harness wafer-scale parallelism for significant inference throughput improvements.
  • It proposes scalable algorithms like MeshGEMM and MeshGEMV, achieving up to 200x better utilization and 606x faster GEMV performance compared to GPU-based systems.
  • The research demonstrates practical innovations such as prefill and decode parallelism and a KV cache shift method to effectively manage non-uniform memory and limited routing resources.

WaferLLM: LLM Inference at Wafer Scale

Introduction

"WaferLLM: LLM Inference at Wafer Scale" introduces WaferLLM, a novel system designed specifically for wafer-scale AI accelerators. Traditional inference systems primarily optimized for GPUs fail to leverage the full potential of wafer-scale integration, which integrates a vast number of AI cores in a mesh-based architecture with large distributed memory and extremely high bandwidth.

Architectural Overview

WaferLLM is underpinned by the PLMR device model, which is tailored to capture unique hardware characteristics of wafer-scale architectures. This model highlights four key properties crucial for efficient system design:

  1. Massive Parallel cores (P): Enabling fine-grained partitioning to manage millions of cores.
  2. Non-uniform memory access Latency (L): Mitigating the latency variations across extensive NoC hops.
  3. Constrained local Memory (M): Ensuring efficient memory usage given limited on-chip core memory.
  4. Limited Routing resources (R): Carefully managing communication paths within the mesh topology. Figure 1

    Figure 1: Key components in LLM inference on wafer-scale architecture.

Innovations in Wafer-Scale Parallelism

WaferLLM introduces innovative strategies for achieving wafer-scale parallelism:

  • Prefill Parallelism: Employs mesh-based partitioning to maximize core utilization, replacing GPU-based GEMM operations with new PLMR-compliant distributed GEMM. Figure 2

    Figure 2: Prefill parallelism plan for massive-scale mesh architectures.

  • Decode Parallelism: Utilizes a fine-grained tensor replication strategy to enhance parallelism and minimize communication overhead during the autoregressive token-by-token generation phase. Figure 3

    Figure 3: Decode parallelism plan, ensuring minimal communication costs.

  • Shift-based KV Cache Management: Proposes a novel KV cache shift method for balanced core utilization, overcoming the skewed resource usage typical in GPU systems. Figure 4

    Figure 4: Comparison between KV cache concatenation and KV cache shift methods.

Scalable Algorithms for Efficient Inference

WaferLLM advances the field with two scalable algorithm variants tailored for wafer-scale accelerators:

  • MeshGEMM: A distributed GEMM algorithm optimized for minimal communication overhead and efficient memory usage. It leverages cyclic shifting and interleaving to meet the stringent PLMR requirements. Figure 5

    Figure 5: Performance comparison of MeshGEMM with other matrix multiplication algorithms.

  • MeshGEMV: A scalable GEMV solution utilizing a two-way K-tree allreduce algorithm, reducing communication paths and improving latency performance. Figure 6

    Figure 6: MeshGEMV's superiority over traditional GEMV implementations.

Performance Evaluation

Trailing traditional systems like T10 and Ladder, WaferLLM demonstrates a profound increase in inference throughput, boasting impressive improvements in both energy efficiency and computation speed:

  • Achieves up to 200x better utilization of wafer-scale accelerators compared to existing systems.
  • Delivers 606x faster GEMV operations than advanced GPU implementations, while significantly reducing energy costs.

Micro-benchmarks: MeshGEMM yields 2-3x speedup over leading GEMM algorithms such as SUMMA and Cannon.

Conclusion

WaferLLM effectively harnesses the unique capabilities of wafer-scale accelerators, achieving significant enhancements in LLM inference performance. As wafer-scale computing evolves, WaferLLM's methodologies pave the way for future developments in AI model deployment, emphasizing the role of massive parallelism and distributed architectures. Continued advancements in wafer-scale technology and corresponding software ecosystems are expected to further enhance inference capabilities, making WaferLLM a critical asset in AI computation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 389 likes about this paper.