Fast Stencil-Code Computation on a Wafer-Scale Processor (2010.03660v1)

Published 7 Oct 2020 in cs.DC

Abstract: The performance of CPU-based and GPU-based systems is often low for PDE codes, where large, sparse, and often structured systems of linear equations must be solved. Iterative solvers are limited by data movement, both between caches and memory and between nodes. Here we describe the solution of such systems of equations on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well. We achieve 0.86 PFLOPS on a single wafer-scale system for the solution by BiCGStab of a linear system arising from a 7-point finite difference stencil on a 600 X 595 X 1536 mesh, achieving about one third of the machine's peak performance. We explain the system, its architecture and programming, and its performance on this problem and related problems. We discuss issues of memory capacity and floating point precision. We outline plans to extend this work towards full applications.

Citations (59)

View on Semantic Scholar

Summary

The paper introduces an innovative use of wafer-scale processors to accelerate stencil-code computations for solving PDEs.
The methodology leverages the CS-1’s 380,000 cores and high memory bandwidth to optimize BiCGStab operations on large sparse systems.
Performance tests on a 7-point finite difference stencil over a 3D mesh show the CS-1 achieving 0.86 PFLOPS, highlighting its HPC potential.

Fast Stencil-Code Computation on a Wafer-Scale Processor

The paper "Fast Stencil-Code Computation on a Wafer-Scale Processor" explores the application of a unique architectural innovation, a wafer-scale processor, specifically the Cerebras Systems CS-1, to address challenges inherent in solving partial differential equations (PDEs). Traditional CPU and GPU systems struggle with PDEs due to limited memory bandwidth and high communication latency, which are critical factors primarily because these applications usually involve solving large, sparse linear systems using iterative methods. The paper evaluates the efficacy of the Cerebras CS-1 in overcoming these challenges through its high memory bandwidth and reduced communication latency.

Overview of the Wafer-Scale Processor

The CS-1 is a remarkable piece of engineering where an entire processing unit is fabricated on a single silicon wafer, circumventing the constraints of traditional chip-based systems. With approximately 380,000 cores and 18 GB of on-wafer fast SRAM, the CS-1 facilitates significant memory bandwidth, reducing the latency typically associated with off-chip data movement. The processor's architecture leverages a 2D mesh interconnection fabric, enabling efficient communication across its extensive array of processors.

BiCGStab and Architectural Mapping

The paper implemented the BiCGStab (Biconjugate Gradient Stabilized) method to solve large-scale linear systems derived from PDEs on the CS-1. The authors detail how they utilized the architecture's features, such as its SIMD capabilities and distributed memory model, to map and execute stencil computations efficiently. Critical to this implementation was leveraging the CS-1’s capacity for high parallelism and well-optimized data movement across its processing cores.

The architecture was particularly suited to the requirements of BiCGStab because the operation relies heavily on memory access patterns that can be effectively optimized using the CS-1’s direct access to on-wafer memory and rapid inter-processor communication. These attributes bypass the primary bottlenecks faced by conventional systems, leading to accelerated computation.

Performance and Implications

On solving a linear system arising from a 7-point finite difference stencil over a large 3D mesh, the CS-1 achieved a performance of 0.86 PFLOPS, signifying about one-third of the machine’s peak capability. This performance is significant in the context of high-performance computing (HPC), given that even the top supercomputers achieve a mere fraction of their floating-point peak when tackling similar problems due to bandwidth and latency barriers.

Theoretical and Practical Implications

The utilization of the wafer-scale processor demonstrates a significant leap in handling sparse matrix computations seen in PDE solvers. By revealing the processor's ability to execute high-throughput computations for scientific and engineering applications, this work suggests promising future directions. It opens avenues for real-time simulations that were previously unattainable due to resource constraints.

In particular, computational fluid dynamics (CFD) and other fields requiring high-fidelity but computationally intensive simulations may gain new operational efficiencies. Although the wafer's memory size imposes constraints on problem scale, the efficiency gains afforded by the architectural paradigm may encourage further investment in developing larger scales or multi-wafer systems in the future.

Future Directions

The paper indicates potential scalability in the wafer-scale computing approach, especially given the prospects of improvements in fabrication technologies and inter-wafer communication strategies. While true petascale applications remain limited by memory depth, continued advancements could reconcile these limitations, making wafer-scale computing indispensable in next-generation HPC systems. Moreover, this research implicitly suggests a convergence of methodologies between traditional numerical computation and emerging frameworks, potentially influencing AI deployments where similar computational intensities occur.

Overall, this paper offers compelling evidence of the wafer-scale processor’s capability to mitigate long-standing performance bottlenecks in HPC, potentially reshaping computational paradigms in scientific research and complex modeling tasks.

PDF Markdown

Related Papers

YouTube

Show All Videos