Papers
Topics
Authors
Recent
Search
2000 character limit reached

SISA: A Scale-In Systolic Array for GEMM Acceleration

Published 31 Mar 2026 in cs.AR and cs.AI | (2603.29913v1)

Abstract: The currently dominant AI/ML workloads, such as LLMs, rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.

Summary

  • The paper demonstrates that slab-partitioned systolic arrays can adapt to varied GEMM shapes, enabling up to 8.52× speedup for LLM workloads.
  • It details a flexible fusion and power gating mechanism that reduces energy-delay product by up to 93% for suboptimal matrix dimensions.
  • The design incurs minimal hardware overhead with less than a 5.5% area increase compared to monolithic systolic arrays like TPUv4.

SISA: Scale-In Systolic Array for GEMM Acceleration

Introduction

The proliferation of LLMs and other DL models has established General Matrix-Matrix Multiplication (GEMM) as a fundamental primitive in contemporary AI hardware. Conventional systolic array (SA) architectures—exemplified by monolithic, fixed-size arrays as in TPUv4—exhibit optimal compute efficiency only for square or large, regular GEMM shapes. However, prevalent LLM workloads induce high variability and skewness in matrix dimensions, leading to extensive underutilization of processing elements (PEs). This architectural inefficiency is pronounced in prompt-prefill scenarios, for which token sequence lengths are typically small, producing tall-skinny or short-wide matrices.

SISA (Scale-In Systolic Array) introduces a slab-based architectural paradigm, where a square SA is partitioned into independently schedulable horizontal slabs. Slabs can operate independently, be power-gated, or fused at runtime—enabling robust adaptation to a wide spectrum of GEMM shapes without incurring the memory and area penalties typical of alternative reconfigurable or pod-based SAs.

SISA Architecture

SISA employs a regular n×nn \times n SA partitioned along the vertical axis into kk horizontal slabs, each slab encompassing a contiguous subset of rows. Each slab is equipped with slab-local activation and weight buffers. Slabs can be operated in three modes: independently for highly-skewed, small-MM GEMMs, fused into taller logical slabs for intermediate MM, or as a fully-monolithic array for large-MM computations. Output is routed to a global output buffer organized for minimal bandwidth contention. All buffer organizations support double buffering for data movement overlap. Figure 1

Figure 1: High-level SISA architecture, showing memory hierarchy and the slab fusion mechanism.

The fusion and bypass mechanisms in SISA employ lightweight multiplexers for inter-slab forwarding of weights, allowing tiled compositions along the height dimension with minimal critical-path impact. Power-gating is realized at slab granularity, decreasing leakage for underutilized configurations—particularly relevant when real-world workloads have short-token sequences and thus require only subsets of the array.

Scheduling and Tiling

GEMM workloads are decomposed into tiles mapped onto slabs, exposing parallelism across multiple slabs for batched decode or short prompts. The tiling routine statically maps tiles given buffer constraints and slab configuration. For best-case utilization, when MM matches a slab's height, all PEs within each slab are engaged in independent tile processing over the NN dimension. For taller GEMMs (M>hslabM > h_{\text{slab}}), slabs are fused until the computation fits, trading off parallelism for improved intra-slab data reuse. With M>nM > n, the full array is fused; residual computations are processed using slab-aligned, possibly power-gated executions.

This tiling and scheduling logic ensures that SISA retains high utilization across the variable GEMM shapes naturally encountered in LLMs, with minimal hardware complexity and modest buffer overheads compared with aggressive multi-pod or fully reconfigurable designs.

Empirical Results

SISA was synthesized for a 128×128128 \times 128 BF16 PE array, targeting comparable technology and on-chip memory budgets as TPUv4 and ReDas. The core empirical findings emphasize several strong claims:

  • Speedup: SISA provides up to kk0 speedup for small kk1 GEMMs over a monolithic TPU baseline. This advantage is most significant for interactive LLM workloads with short prompts, where matrix shape mismatch is acute. Figure 2

    Figure 2: SISA speedup versus TPU baseline across sequence lengths and LLM models.

  • Energy-Delay Product (EDP): In the same regime, SISA delivers up to 93% EDP reduction; the average EDP gain persists across most sequence lengths below kk2. Figure 3

    Figure 3: SISA normalized EDP versus TPU. Substantial gains for typical LLM workload sizes.

  • Worst-Case Penalty: For square/large kk3 GEMMs, SISA's overhead is limited to an 8.47% EDP delta, attributable to buffer-multiplexing and minor leakage from additional SRAM and gating logic.
  • Area Efficiency: The architectural overhead, including slab fusion/power-gating and more complex buffering, leads to less than 5.5% total area increase over a baseline TPU implementation. This is markedly below the 70%+ per-PE area and kk4 per-PE power overhead reported in ReDas for INT8, which only worsens for BF16.
  • Comparison with ReDas: Across kk5 to kk6, SISA achieves up to kk7 speedup versus ReDas, with negligible scenarios where SISA is outperformed—generally limited to specific intermediate kk8 where ReDas' coarse-granularity reshaping is optimal. Figure 4

    Figure 4: Speedup of SISA compared to ReDas, emphasizing consistent gains especially for small-batch and skewed GEMMs.

  • Layer-Wise Latency: For Qwen2.5-0.5B, at small kk9 (best-case) SISA realizes substantial latency reduction across all GEMM layers, whereas for unfavorably-shaped MM0 SISA and ReDas are nearly matched, confirming the adaptability and steady lower-bound latency profile of SISA. Figure 5

    Figure 5: Latency analysis of Qwen2.5-0.5B layers at best and worst case, weighted by layer occurrence.

Implications and Future Directions

SISA addresses a fundamental architectural inefficiency—SA underutilization—prevalent in LLM-serving scenarios where prompt lengths (and thus matrix outer dimensions) are variable and often much smaller than the array's height. The design integrates lightweight adaptation strategies (slab partitioning, power-gating, efficient fusion) with minimal cost to hardware complexity and area.

Practically, SISA enables accelerator vendors and datacenter operators to provision fewer cycles (and therefore joules) for popular workloads, particularly under realistic QoS constraints with small to moderate batch sizes. The theoretical implication is the demonstration that slab-level parallelism, when carefully orchestrated, can realize finer-grained adaptation to skewed matrix shapes without the prohibitive area/power overheads of pod-based or highly-dynamic designs.

Future development in SA-based accelerators will likely combine SISA-like granularity (slab-based partitioning) with improved interconnect and global control strategies, possibly integrating workload-adaptive reconfiguration in conjunction with memory-side compression or partitioned attestation for multi-tenant regimes. Additional extension to SISA could focus on support for mixed-precision execution, dynamic sparsity exploitation, or further SRAM/DRAM interface optimization to close the remaining efficiency gap in boundary cases.

Conclusion

SISA demonstrates that a modestly-complex slab-partitioned SA can deliver substantial speedup and energy efficiency improvements for GEMM-centric workloads manifesting high matrix-shape skewness, with concrete gains over both monolithic (e.g., TPUv4) and flexible but area-heavy (e.g., ReDas) baselines. Its performance, power, and area profiles underscore its suitability for modern AI inference (especially LLM prefill/decode) deployments where resource underutilization is both common and expensive. The SISA design provides a compelling architectural reference point for next-generation GEMM accelerators prioritizing scalable utilization and minimal overhead across highly dynamic inference workloads (2603.29913).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.