Flattened Systolic Array Architecture

Updated 11 March 2026

Flattened systolic arrays are reconfigurable designs that transform 2D grids into 1D or elongated pipelines while preserving native dataflow.
They employ hardware microarchitectural enhancements and software space–time mapping to optimize performance for asymmetric workloads such as GEMM and LSTMs.
Empirical evaluations demonstrate notable speedups and energy efficiency improvements, achieving up to 4.6× performance gains and 8.3× EDP reduction.

A flattened systolic array is a programmable hardware or software architecture in which an originally two-dimensional (or N-dimensional) systolic array is dynamically reconfigured—through logical re-shaping, space–time transformation, or microarchitectural enhancements—into one-dimensional or highly elongated forms, while preserving the native systolic dataflow and maximizing hardware utilization for workloads exhibiting strong dimensional imbalance. Flattened systolic arrays include both hardware-based schemes such as ReDas and FSA, which mutate physical array topologies, and software-virtualized models using linear space–time projections mapped onto generic parallel processors.

1. Definition and Formalism

A flattened systolic array is an instantiation of systolic computing in which the arrangement of processing elements (PEs) is dynamically reconfigured from a canonical grid shape (e.g., $128 \times 128$ ) into extreme aspect ratios (e.g., $1 \times N$ or $N \times 1$ ), or in software, into a single-lane or vectorized schedule via linear mapping of the computational dependence graph. The configuration is typically governed by a projection $\pi: \mathbb{Z}^n \rightarrow \mathbb{Z}$ and a schedule $\theta: \mathbb{Z}^n \rightarrow \mathbb{Z}$ , such that the executing iteration $z \in \mathbb{Z}^n$ is mapped onto PE index $s = \pi(z)$ and time $t = \theta(z)$ . The correctness of such flattenings relies on space–time transformation matrices and recurrence equations that preserve causality and avoid PE-time conflicts (Rong et al., 2020).

For hardware arrays, such as ReDas, the flattening is realized via reconfigurable data paths and local interconnects, supporting up to 129 logical shapes for a $128 \times 128$ array (including all forms $1 \times 508$ through $508 \times 1$ ). Software flattening, as exemplified in T2S-GPU, allows the full N-dimensional schedule to be mapped onto linear pipelines exploiting GPU SIMD registers (Rong et al., 2020).

2. Microarchitectural Mechanisms for Array Flattening

Reconfigurable Datapath Topologies

ReDas implements flattened systolic forms by partitioning the physical array into four contiguous sub-arrays, each reconfigured at runtime via short periphery links into a single logical "ring." Each PE only requires bidirectional short-range connections to direct neighbors—avoiding long multi-hop wires—thus containing hardware overhead. The logical shape constraint is defined as:

$\begin{cases} 0 < R_l \le R_p/2 \ C_l = 4\,(C_p - R_l) \end{cases} \quad \text{or} \quad \begin{cases} 0 < C_l \le C_p/2 \ R_l = 4\,(R_p - C_l) \end{cases} \quad \text{or} \quad R_l=R_p,\;C_l=C_p$

where $(R_l, C_l)$ is the logical array dimension. This formulation yields all flattened and non-square configurations for a $128\times128$ array (Han et al., 2023).

PE Microarchitecture

Each PE in ReDas presents four bidirectional ports (north/south/east/west). Input arbitration and operand ordering within the PE is managed through fine-grained crossbars and output multiplexers:

Crossbar $\alpha$ : classifies data as operational (to MAC) or bypass (to turn or go straight).
Crossbar $\beta$ : sequences the three possible MAC operands into the multiplier–adder.
Output MUX $\gamma$ : selects from up to four output streams.

An indicative pseudo-code within the PE in a given cycle is:

inputs ← listen(north, south, east, west)
{opA, opB, pass1, pass2} ← crossbar₁(inputs)
mac_out ← MAC(opA, opB, stationary_reg)
{pass1', pass2'} ← corner_unit(pass1, pass2, shape_info)
outputs ← mux(named={mac_out, pass1', pass2', ’idle’})
drive(north, south, east, west) ← outputs
stationary_reg ← update_if_needed(mac_out)  # WS/IS modes only

This facilitates maintaining correct dataflow for output-stationary (OS), weight-stationary (WS), or input-stationary (IS) mappings, regardless of flattening (Han et al., 2023).

3. Logical Shape Flexibility and Dataflow Preservation

Flattened arrays support a family of logical shapes with extreme aspect ratios, including:

$R_l=1\implies C_l=508$ (a $1\times508$ chain: fully "flattened" row)
$C_l=1\implies R_l=508$ (a $508\times1$ chain: fully "flattened" column)

For each logical shape, the mapping of a workload (e.g., a GEMM of $M\times K$ by $K\times N$ ) onto the array is governed by the bank allocation of on-chip memory (multi-mode buffers) and by tile factors $(M_t, K_t, N_t)$ chosen to match $(R_l, C_l)$ up to double-buffering constraints.

The dataflow is preserved with minor timer adjustments: in OS mode, execution time per tile

$T_{\mathrm{exe}} = R_l + (R_l + C_l + M_t - 1) + 4 \times \min(R_l, C_l)$

where the $4\times \min(R_l,C_l)$ term represents the bypass cycles incurred at the turns between sub-arrays in the ringed configuration (Han et al., 2023).

4. Software Flattened Arrays and Space–Time Mapping

In a software context, a flattened (1-D) systolic array arises by mapping the full computational dependence graph into a single dimension (PE index $s$ and scheduled time $t$ ) via a space–time transformation matrix $M \in \mathbb{Z}^{2\times n}$ :

$[s,\, t]^{\mathsf{T}} = M\,z$

with iteration block $z \in \mathbb{Z}^n$ . This is mathematically equivalent to projecting the N-dimensional uniform recurrence equation (URE) into a 1-D pipeline, as shown for convolutions:

For 1-D convolution, e.g., $Z(c) = \sum_{q=0}^{Q-1} x(c+q)\cdot w(q)$ , flattening is achieved via $[s, t] = [c+q,\, q]$ ; the backbone FMA and shift operations correspond to high-throughput SIMD vector instructions (Rong et al., 2020).
For 2-D convolution, a suitable transformation flattens the $4$-D URE over $(c, r, q, p)$ into $(s, t)$ : $s = c + q$ , $t = r + P\cdot p$ .

On practical GPU platforms, each thread manages a register vector of length $S$ (the logical array dimension, constrained by the SIMD width and available registers), wherein temporal forwarding and register shuffling realize the logical nearest-neighbor communications of the flattened systolic pipeline (Rong et al., 2020).

5. Flattened Arrays in Specialized Workloads

Flattening is particularly advantageous for highly asymmetric workloads, such as matrix-vector multiplies (where $K=1$ ) or LSTM layers. In ReDas, if a GEMM has $M\gg1$ , $K=1$ , mapping to a $508\times1$ ring (fully flattened row) enables nearly all rows to be active, compared to only 1 out of 128 columns in a conventional $128\times128$ shape. Empirically, DeepSpeech2 LSTM layers achieved $8.2\times$ speedup in such a configuration. Across evaluated DNNs and computer vision models (e.g., TinyYOLO-v2), optimal mappers prefer extremely flat or modestly stretched aspect ratios for nearly every layer; statically square shapes rarely produce optimal utilization (Han et al., 2023).

In the context of attention mechanisms, FSA extends the concept of flattening not by changing physical shape, but by fusing the entire sequence of FlashAttention operations into one deeply pipelined 2D array, eliminating all dataflow detours to vector units and minimizing register/SRAM port contention (Lin et al., 15 Jul 2025).

6. Performance Implications and Quantitative Outcomes

Flattened systolic arrays deliver significant performance and energy efficiency improvements. For hardware flattening, ReDas achieves on average $4.6\times$ speedup and $8.3\times$ EDP reduction on real DNN workloads, with two-level PE extension (crossbars, multiplexers) and multi-mode SRAM buffers adding minimal area and no multi-ported requirements (Han et al., 2023). In software, flat systolic arrays (T2S-GPU) obtain up to $159\%$ of the performance of specialized hardware for 1-D convolution and near $100\%$ utilization of SIMD lanes (Rong et al., 2020).

FSA, with modest ( $\approx10\%$ ) area overhead versus a standard $128\times128$ array, reports $4.83\times$ higher attention throughput than TPUv5e and $1.77\times$ that of AWS NeuronCore-v2, with mean relative error in fused operations $<$ $10^{-2}$ (Lin et al., 15 Jul 2025). This is achieved by executing all stages of FlashAttention in a unified per-PE pipeline.

A rule of thumb, as observed in ReDas, is that if $\,\min(M,N)\lesssim R_p/4\,$ (such as $<32$ for a $128$-wide array), full flattening should be considered to maximize PE occupancy. Mapper-driven selection of logical shape nearly always yields $2\times$ – $8\times$ speedups and up to $10\times$ EDP improvement for layers matching these criteria.

7. Practical Mapper Algorithms and Resource Allocation

Selecting optimal flattening and array shape is managed by an analytical-layer-by-layer mapper, which evaluates all feasible logical shapes, dataflows, and memory bank allocations, and scores candidates via cost models of runtime and energy:

for each DNN layer (GEMM of M,K,N):
  for dataflow in {OS, WS, IS}:
    for logical shape (R_l,C_l) per Eq.(1):
      derive bank-allocation D_sta, D_non
      derive valid tile sizes M_t,K_t,N_t
      prune low-utilization tiles
  prune to ≲2000 candidates
  foreach candidate:
    estimate runtime, energy
  pick configuration minimizing EDP or latency

(Han et al., 2023)

No exhaustive enumeration ( $\sim10^{10}$ possibilities) is required; interval sampling in tile-size and buffer allocation space keeps candidate sets tractable.

References

"ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array" (Han et al., 2023)
"Systolic Computing on GPUs for Productive Performance" (Rong et al., 2020)
"SystolicAttention: Fusing FlashAttention within a Single Systolic Array" (Lin et al., 15 Jul 2025)

Markdown Report Issue Upgrade to Chat

References (3)

Systolic Computing on GPUs for Productive Performance (2020)

ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array (2023)

SystolicAttention: Fusing FlashAttention within a Single Systolic Array (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flattened Systolic Array.

Flattened Systolic Array Architecture

1. Definition and Formalism

2. Microarchitectural Mechanisms for Array Flattening

Reconfigurable Datapath Topologies

PE Microarchitecture

3. Logical Shape Flexibility and Dataflow Preservation

4. Software Flattened Arrays and Space–Time Mapping

5. Flattened Arrays in Specialized Workloads

6. Performance Implications and Quantitative Outcomes

7. Practical Mapper Algorithms and Resource Allocation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Flattened Systolic Array Architecture

1. Definition and Formalism

2. Microarchitectural Mechanisms for Array Flattening

Reconfigurable Datapath Topologies

PE Microarchitecture

3. Logical Shape Flexibility and Dataflow Preservation

4. Software Flattened Arrays and Space–Time Mapping

5. Flattened Arrays in Specialized Workloads

6. Performance Implications and Quantitative Outcomes

7. Practical Mapper Algorithms and Resource Allocation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research