Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CELLO: Co-designing Schedule and Hybrid Implicit/Explicit Buffer for Complex Tensor Reuse (2303.11499v2)

Published 20 Mar 2023 in cs.DC and cs.AR

Abstract: Tensor algebra accelerators have been gaining popularity for running high-performance computing (HPC) workloads. Identifying optimal schedules for individual tensor operations and designing hardware to run these schedules is an active area of research. Unfortunately, operators in HPC workloads such as Conjugate Gradient often have operators with skewed shapes, fundamentally limiting the reuse any schedule can leverage. Moreover, the operators form a complex DAG of dependencies, making it challenging to apply simple fusion/pipelining techniques to extract inter-operation reuse. To address these challenges, this work proposes an accelerator CELLO. CELLO uses a novel on-chip buffer mechanism called CHORD co-designed with a novel scheduler called SCORE, which together enables identifying and exploiting reuse over complex DAGs of tensor operations. CELLO provides 4x geomean speedup and 4x energy efficiency over state-of-the-art accelerators across HPC workloads.

Citations (1)

Summary

We haven't generated a summary for this paper yet.