CELLO: Co-designing Schedule and Hybrid Implicit/Explicit Buffer for Complex Tensor Reuse (2303.11499v2)
Abstract: Tensor algebra accelerators have been gaining popularity for running high-performance computing (HPC) workloads. Identifying optimal schedules for individual tensor operations and designing hardware to run these schedules is an active area of research. Unfortunately, operators in HPC workloads such as Conjugate Gradient often have operators with skewed shapes, fundamentally limiting the reuse any schedule can leverage. Moreover, the operators form a complex DAG of dependencies, making it challenging to apply simple fusion/pipelining techniques to extract inter-operation reuse. To address these challenges, this work proposes an accelerator CELLO. CELLO uses a novel on-chip buffer mechanism called CHORD co-designed with a novel scheduler called SCORE, which together enables identifying and exploiting reuse over complex DAGs of tensor operations. CELLO provides 4x geomean speedup and 4x energy efficiency over state-of-the-art accelerators across HPC workloads.