Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reconfigurable Systolic Arrays

Updated 5 June 2026
  • Reconfigurable systolic arrays are parallel hardware architectures that dynamically adjust their configuration to maximize processing element utilization and energy efficiency.
  • They incorporate physical partitioning, dynamic dataflow switching, and multi-mode buffers to efficiently map diverse computational kernels such as DNN layers and signal processing tasks.
  • Design methodologies leverage compile-time heuristics, ML-based mapping, and software-defined models to achieve significant improvements in performance, energy efficiency, and reliability.

A reconfigurable systolic array is a parallel hardware architecture that dynamically alters its structure, size, dataflow, or function to optimally map a wide range of computational kernels and data shapes. Unlike traditional fixed-dimensional systolic arrays, which suffer from poor utilization when the problem shape mismatches hardware, reconfigurable systolic arrays incorporate architectural, microarchitectural, or control-level mechanisms to support runtime flexibility. They are designed to maximize processing element (PE) utilization, data reuse, and energy efficiency across diverse workloads such as deep neural networks (DNNs), signal processing, mixed-precision arithmetic, and even high-level algorithmic recurrences.

1. Architectural Principles and Motivation

Conventional systolic arrays employ a rigid 2D mesh of PEs with nearest-neighbor interconnects, optimized for dense matrix multiplication (GEMM) of static shape. This rigidity leads to underutilization and energy waste—particularly for modern model-pruned DNNs (where many weight slices are zero) or for workloads like LSTMs and depthwise convolutions whose GEMM dimension may not match the hardware array size. Reconfigurability in systolic arrays seeks to address these utilization and efficiency bottlenecks by enabling the array to be partitioned, reshaped, or mode-switched at fine granularity to fit each kernel or layer’s needs (Lym et al., 2020, Han et al., 2023).

Design methodologies for reconfigurability span several axes:

  • Physical partitioning: subdivision into subarrays or sub-cores, with reconfigurable interconnect to combine or isolate sections.
  • Topological reshaping: dynamic selection of mesh shape or dimension (e.g., tall, wide, square, 1D chain).
  • Runtime dataflow switching: per-tile or per-wave selection among output-, weight-, or input-stationary dataflows.
  • Elementwise operation switching: embedding non-matrix or nonlinear functions in PEs for tasks such as activation or attention (Sun et al., 2024, Lin et al., 15 Jul 2025).
  • Fault tolerance and redundancy: dynamic allocation of PEs to redundant groups for reliability enhancement (Cherezova et al., 6 Mar 2025).

2. Core Microarchitectural Mechanisms

Reconfigurable systolic arrays are realized using various microarchitectural strategies:

  • Multiplexed Interconnects and Switch Networks: Local or semi-global wiring augmented with multiplexers allows subarray boundaries to be traversed or isolated at runtime. For example, FlexSA decomposes a 128×128 array into four 64×64 sub-cores, interconnected through a network of 1:2 MUXes and additional horizontal/vertical bus lines, reconfigured each wave by a single instruction (Lym et al., 2020). ReDas adds roundabout local links and crossbars between adjacent PEs, avoiding long wires while supporting 129 logical shapes on a 128×128 physical grid (Han et al., 2023).
  • Banked and Multi-Mode Buffers: To adapt on-chip SRAM and input/output staging, arrays are bordered by multi-mode buffers that can be reassigned—dynamically—per tile or dataflow role (input, weight, output, idle). Buffer partitioning and bandwidth constraints are enforced by simple FSM controllers and banked single-port SRAM layouts, as in ReDas (Han et al., 2023).
  • Dynamic Control and Reconfiguration Interface: Reconfiguration instructions are typically a compact bitmask or mode word, broadcast to the array per tile or per systolic “wave.” The configuration logic sets all mux select lines, local crossbar states, and data buffer roles atomically, usually incurring ≤1 cycle reconfiguration cost amortized over the much longer kernel compute phase.
  • PE Dual-Mode or Tri-Mode Data Paths: PE datapaths may be extended to support dual operation (e.g., multiply-accumulate or elementwise nonlinear function; forward or accumulate modes) using a small number of mode control signals and additional logic. For nonlinear compute, as in ONE-SA (Sun et al., 2024), select PEs can execute piecewise-linear approximations, with other PEs forwarding streams as needed.

3. Operating Modes, Dataflows, and Flexibility

A hallmark of advanced reconfigurable systolic arrays is the presence of multiple runtime operating modes, which can be toggled per computational tile/workload batch:

  • Subarray Partitioning: E.g., FlexSA's four modes—Full-Wave (all sub-cores merged), Vertical Sub-Wave (two tall arrays), Horizontal Sub-Wave (two wide arrays), and Independent Sub-Wave (four independent arrays, minimum spatial reuse) (Lym et al., 2020). Each is chosen dynamically with a heuristic that seeks to maximize PE utilization and data reuse.
  • Fine-Grained Reshaping: ReDas enables logical grid sizes from 1×508 up to 128×128 by re-routing local PE links; such granularity allows matching PEs exactly to any GEMM or convolutional layer’s input and output dimensions (Han et al., 2023).
  • Multi-Dataflow Support: Dataflow flexibility—choosing among output-, input-, or weight-stationary—is essential for maximizing reuse and minimizing on-chip and off-chip memory traffic. Arrays with per-tile dataflow selection can adapt to the optimal movement pattern for each DNN layer (Han et al., 2023).
  • Algorithm-Specific Reconfiguration: Specializations like FSA for attention (embedding reduction and nonlinear ops into the array alongside multiplications) or ArrayFlex (pipeline depth collapsing) extend reconfigurability beyond just GEMM mapping (Lin et al., 15 Jul 2025, Peltekis et al., 2022).
  • Dynamic Fault Tolerance: Structural redundancy—such as switching at runtime between performance mode, dual modular redundancy, and triple modular redundancy—is used for reliability adaptation in safety-critical deployment, as in FORTALESA (Cherezova et al., 6 Mar 2025).

4. Compilation, Mapping, and Automation

Reconfigurable arrays necessitate compilers and mappers to select the optimal configuration for each kernel or layer of a workload:

  • Compile-Time Mapping Heuristics: FlexSA uses heuristics to tile large matrix multiplications into partitions that best exploit the available modes (FW > HSW/VSW > ISW), parameterized by tile size and buffer capacity (Lym et al., 2020). ReDas formalizes configuration as an optimization problem maximizing utilization, subject to array shape and buffer-fit constraints, then uses analytical models to rank candidates (Han et al., 2023).
  • Polyhedral Space-Time Scheduling: WideSA (Dai et al., 2024) extends the classical polyhedral model to programmatically generate legal systolic mappings (array partition, schedule, tile sizes) for nested uniform recurrences, auto-selecting loop bands to map onto spatial and temporal dimensions of an AIE mesh.
  • ML-Based Online Configuration: SARA/SAGAR implements an integrated neural network (ADAPTNET) whose inference engine (ADAPTNETX) predicts, at runtime, the optimal partition and dataflow for incoming workload parameters. ADAPTNETX configures the PE mux network in <600 cycles, ensuring all mode toggling is overlapped with computation (Samajdar et al., 2021).
  • Portable and Software-Defined Models: Tools such as Cyclotron and the language/compiler for systolic GPU mapping enable high-level declarations of recurrences or projections to be automatically lowered to reconfigurable hardware or software systolic arrays (Rong et al., 2020, Sundram et al., 13 Nov 2025).

5. Practical Implementations and Measured Impact

Empirical studies show reconfigurable systolic arrays deliver substantial improvements in utilization, performance, energy efficiency, and flexibility across various benchmarks and platforms:

  • FlexSA achieves up to 37% greater PE utilization, 1.7× on-chip data reuse, and 28% dynamic energy savings for pruned DNN training relative to fixed arrays (Lym et al., 2020).
  • ReDas delivers 4.6× average speedup and 8.3× EDP reduction across a representative set of MLPerf models, and supports 129 logical shapes with only 13% area overhead (Han et al., 2023).
  • SARA/SAGAR’s ML-driven partitioning attains 2.8× (SAGAR) to 3.2× (distributed) speedup over monolithic arrays, with a compute density of 400 GOPS/mm² and energy efficiency of ~8.4 TOPS/W (Samajdar et al., 2021).
  • Bitwise reconfigurable arrays (BitSys) enable runtime-tunable mixed-precision multiplication (1/2/4/8b) for QNN inference with single-digit-nanosecond reconfiguration, achieving up to 3.6× speedup on FPGA vs. 8b-only designs (Liu et al., 26 Feb 2026).
  • Spectrally agile spatial arrays for wireless numerics approach the area and throughput efficiency of specialized FIR/FFT cores when compute-bound, with reconfiguration in 32 ns and only a modest area premium (Rasteh et al., 3 Dec 2025).
  • Fault-tolerant architectures like FORTALESA offer up to 3× speedup (over static TMR designs), 6× resource savings, and layer-adaptive reliability while maintaining high performance via runtime mode switching (Cherezova et al., 6 Mar 2025).
  • Small-granularity clusters of systolic arrays (for Winograd or sparse kernels) retain >90% PE utilization by per-layer reconfiguration of tiling and cluster count (Shi et al., 2018).

6. Limitations, Trade-Offs, and Future Directions

Architectural reconfigurability introduces certain overheads:

  • Hardware complexity: Local crossbars, muxes, and extended PE logic contribute to area overhead (1.3–16%, typical).
  • Timing closure: Deeply pipelined or long-range bypass links, if not carefully segmented, may degrade achievable clock rates.
  • Power/area scalability: Extremely fine-grained reconfigurability (e.g., per-PE in ReDas) must balance wire length, congestion, and control complexities, though local-link approaches mitigate some of these costs.
  • Mapping overhead: Compile-time or ML-based mapping requires sufficient accuracy and speed to keep pace with dynamic workloads, particularly for rapidly switching kernels or data modes.

Emerging research focal points include: embedding more nonlinear and attention-style operators directly into systolic arrays (Lin et al., 15 Jul 2025), tighter integration with manycore and shared-memory paradigms (Mazzola et al., 2024), combining fault tolerance and reconfigurability (Cherezova et al., 6 Mar 2025), and end-to-end workflows that automatically bridge programming models, mapping, and hardware synthesis (Rong et al., 2020, Sundram et al., 13 Nov 2025).

7. Representative Implementations and Comparative Table

A selection of notable reconfigurable systolic array architectures, their core mechanisms, and reported impact is provided below:

Architecture Reconfig. Mechanism Reported Impact
FlexSA (Lym et al., 2020) MUX-based 4-way sub-core merge/split +37% PE util, 1.7× reuse, –28% energy
ReDas (Han et al., 2023) Fine-grained local roundabout links 129 shapes, 4.6× speedup, 13% area overhead
SAGAR (Samajdar et al., 2021) ML-driven PE cell partition & dataflow 2.8×–3.2× speedup, 400 GOPS/mm², 99.93% oracle
ArrayFlex (Peltekis et al., 2022) Per-layer transparent pipeline collapsing 9–11% latency, 13–23% power savings
FSA (Lin et al., 15 Jul 2025) Nonlinear/softmax in-PE + dual path 1.77×–4.83× attn. FLOPs/s uplift, ~10% area ovh
BitSys (Liu et al., 26 Feb 2026) Per-layer bitmask config of PE logic 1.3–3.6× speedup (mixed-prec.), 3-cycle reconfig
FORTALESA (Cherezova et al., 6 Mar 2025) Mode-switching redundancy, TMR/DMR/PM 3× speedup, 6× res. savings vs. static TMR

This diversity of approaches highlights the central role of reconfigurability in modern systolic-array compute substrates, spanning from efficient DNN training on pruned and structured sparse models to fault-tolerant and spectrally agile wireless processing. The field continues to evolve towards tighter fusion of algorithm, compiler, and hardware, offering adaptation across performance, reliability, energy, and functional axes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reconfigurable Systolic Arrays.