Papers
Topics
Authors
Recent
Search
2000 character limit reached

Flattened Systolic Array Architecture

Updated 11 March 2026
  • Flattened systolic arrays are reconfigurable designs that transform 2D grids into 1D or elongated pipelines while preserving native dataflow.
  • They employ hardware microarchitectural enhancements and software space–time mapping to optimize performance for asymmetric workloads such as GEMM and LSTMs.
  • Empirical evaluations demonstrate notable speedups and energy efficiency improvements, achieving up to 4.6× performance gains and 8.3× EDP reduction.

A flattened systolic array is a programmable hardware or software architecture in which an originally two-dimensional (or N-dimensional) systolic array is dynamically reconfigured—through logical re-shaping, space–time transformation, or microarchitectural enhancements—into one-dimensional or highly elongated forms, while preserving the native systolic dataflow and maximizing hardware utilization for workloads exhibiting strong dimensional imbalance. Flattened systolic arrays include both hardware-based schemes such as ReDas and FSA, which mutate physical array topologies, and software-virtualized models using linear space–time projections mapped onto generic parallel processors.

1. Definition and Formalism

A flattened systolic array is an instantiation of systolic computing in which the arrangement of processing elements (PEs) is dynamically reconfigured from a canonical grid shape (e.g., 128×128128 \times 128) into extreme aspect ratios (e.g., 1×N1 \times N or N×1N \times 1), or in software, into a single-lane or vectorized schedule via linear mapping of the computational dependence graph. The configuration is typically governed by a projection π:ZnZ\pi: \mathbb{Z}^n \rightarrow \mathbb{Z} and a schedule θ:ZnZ\theta: \mathbb{Z}^n \rightarrow \mathbb{Z}, such that the executing iteration zZnz \in \mathbb{Z}^n is mapped onto PE index s=π(z)s = \pi(z) and time t=θ(z)t = \theta(z). The correctness of such flattenings relies on space–time transformation matrices and recurrence equations that preserve causality and avoid PE-time conflicts (Rong et al., 2020).

For hardware arrays, such as ReDas, the flattening is realized via reconfigurable data paths and local interconnects, supporting up to 129 logical shapes for a 128×128128 \times 128 array (including all forms 1×5081 \times 508 through 508×1508 \times 1). Software flattening, as exemplified in T2S-GPU, allows the full N-dimensional schedule to be mapped onto linear pipelines exploiting GPU SIMD registers (Rong et al., 2020).

2. Microarchitectural Mechanisms for Array Flattening

Reconfigurable Datapath Topologies

ReDas implements flattened systolic forms by partitioning the physical array into four contiguous sub-arrays, each reconfigured at runtime via short periphery links into a single logical "ring." Each PE only requires bidirectional short-range connections to direct neighbors—avoiding long multi-hop wires—thus containing hardware overhead. The logical shape constraint is defined as:

{0<RlRp/2 Cl=4(CpRl)or{0<ClCp/2 Rl=4(RpCl)orRl=Rp,  Cl=Cp\begin{cases} 0 < R_l \le R_p/2 \ C_l = 4\,(C_p - R_l) \end{cases} \quad \text{or} \quad \begin{cases} 0 < C_l \le C_p/2 \ R_l = 4\,(R_p - C_l) \end{cases} \quad \text{or} \quad R_l=R_p,\;C_l=C_p

where (Rl,Cl)(R_l, C_l) is the logical array dimension. This formulation yields all flattened and non-square configurations for a 128×128128\times128 array (Han et al., 2023).

PE Microarchitecture

Each PE in ReDas presents four bidirectional ports (north/south/east/west). Input arbitration and operand ordering within the PE is managed through fine-grained crossbars and output multiplexers:

  • Crossbar α\alpha: classifies data as operational (to MAC) or bypass (to turn or go straight).
  • Crossbar β\beta: sequences the three possible MAC operands into the multiplier–adder.
  • Output MUX γ\gamma: selects from up to four output streams.

An indicative pseudo-code within the PE in a given cycle is:

1
2
3
4
5
6
7
inputs  listen(north, south, east, west)
{opA, opB, pass1, pass2}  crossbar(inputs)
mac_out  MAC(opA, opB, stationary_reg)
{pass1', pass2'}  corner_unit(pass1, pass2, shape_info)
outputs  mux(named={mac_out, pass1', pass2', idle})
drive(north, south, east, west)  outputs
stationary_reg  update_if_needed(mac_out)  # WS/IS modes only

This facilitates maintaining correct dataflow for output-stationary (OS), weight-stationary (WS), or input-stationary (IS) mappings, regardless of flattening (Han et al., 2023).

3. Logical Shape Flexibility and Dataflow Preservation

Flattened arrays support a family of logical shapes with extreme aspect ratios, including:

  • Rl=1    Cl=508R_l=1\implies C_l=508 (a 1×5081\times508 chain: fully "flattened" row)
  • Cl=1    Rl=508C_l=1\implies R_l=508 (a 508×1508\times1 chain: fully "flattened" column)

For each logical shape, the mapping of a workload (e.g., a GEMM of M×KM\times K by K×NK\times N) onto the array is governed by the bank allocation of on-chip memory (multi-mode buffers) and by tile factors (Mt,Kt,Nt)(M_t, K_t, N_t) chosen to match (Rl,Cl)(R_l, C_l) up to double-buffering constraints.

The dataflow is preserved with minor timer adjustments: in OS mode, execution time per tile

Texe=Rl+(Rl+Cl+Mt1)+4×min(Rl,Cl)T_{\mathrm{exe}} = R_l + (R_l + C_l + M_t - 1) + 4 \times \min(R_l, C_l)

where the 4×min(Rl,Cl)4\times \min(R_l,C_l) term represents the bypass cycles incurred at the turns between sub-arrays in the ringed configuration (Han et al., 2023).

4. Software Flattened Arrays and Space–Time Mapping

In a software context, a flattened (1-D) systolic array arises by mapping the full computational dependence graph into a single dimension (PE index ss and scheduled time tt) via a space–time transformation matrix MZ2×nM \in \mathbb{Z}^{2\times n}:

[s,t]T=Mz[s,\, t]^{\mathsf{T}} = M\,z

with iteration block zZnz \in \mathbb{Z}^n. This is mathematically equivalent to projecting the N-dimensional uniform recurrence equation (URE) into a 1-D pipeline, as shown for convolutions:

  • For 1-D convolution, e.g., Z(c)=q=0Q1x(c+q)w(q)Z(c) = \sum_{q=0}^{Q-1} x(c+q)\cdot w(q), flattening is achieved via [s,t]=[c+q,q][s, t] = [c+q,\, q]; the backbone FMA and shift operations correspond to high-throughput SIMD vector instructions (Rong et al., 2020).
  • For 2-D convolution, a suitable transformation flattens the $4$-D URE over (c,r,q,p)(c, r, q, p) into (s,t)(s, t): s=c+qs = c + q, t=r+Ppt = r + P\cdot p.

On practical GPU platforms, each thread manages a register vector of length SS (the logical array dimension, constrained by the SIMD width and available registers), wherein temporal forwarding and register shuffling realize the logical nearest-neighbor communications of the flattened systolic pipeline (Rong et al., 2020).

5. Flattened Arrays in Specialized Workloads

Flattening is particularly advantageous for highly asymmetric workloads, such as matrix-vector multiplies (where K=1K=1) or LSTM layers. In ReDas, if a GEMM has M1M\gg1, K=1K=1, mapping to a 508×1508\times1 ring (fully flattened row) enables nearly all rows to be active, compared to only 1 out of 128 columns in a conventional 128×128128\times128 shape. Empirically, DeepSpeech2 LSTM layers achieved 8.2×8.2\times speedup in such a configuration. Across evaluated DNNs and computer vision models (e.g., TinyYOLO-v2), optimal mappers prefer extremely flat or modestly stretched aspect ratios for nearly every layer; statically square shapes rarely produce optimal utilization (Han et al., 2023).

In the context of attention mechanisms, FSA extends the concept of flattening not by changing physical shape, but by fusing the entire sequence of FlashAttention operations into one deeply pipelined 2D array, eliminating all dataflow detours to vector units and minimizing register/SRAM port contention (Lin et al., 15 Jul 2025).

6. Performance Implications and Quantitative Outcomes

Flattened systolic arrays deliver significant performance and energy efficiency improvements. For hardware flattening, ReDas achieves on average 4.6×4.6\times speedup and 8.3×8.3\times EDP reduction on real DNN workloads, with two-level PE extension (crossbars, multiplexers) and multi-mode SRAM buffers adding minimal area and no multi-ported requirements (Han et al., 2023). In software, flat systolic arrays (T2S-GPU) obtain up to 159%159\% of the performance of specialized hardware for 1-D convolution and near 100%100\% utilization of SIMD lanes (Rong et al., 2020).

FSA, with modest (10%\approx10\%) area overhead versus a standard 128×128128\times128 array, reports 4.83×4.83\times higher attention throughput than TPUv5e and 1.77×1.77\times that of AWS NeuronCore-v2, with mean relative error in fused operations <<10210^{-2} (Lin et al., 15 Jul 2025). This is achieved by executing all stages of FlashAttention in a unified per-PE pipeline.

A rule of thumb, as observed in ReDas, is that if min(M,N)Rp/4\,\min(M,N)\lesssim R_p/4\, (such as <32<32 for a $128$-wide array), full flattening should be considered to maximize PE occupancy. Mapper-driven selection of logical shape nearly always yields 2×2\times8×8\times speedups and up to 10×10\times EDP improvement for layers matching these criteria.

7. Practical Mapper Algorithms and Resource Allocation

Selecting optimal flattening and array shape is managed by an analytical-layer-by-layer mapper, which evaluates all feasible logical shapes, dataflows, and memory bank allocations, and scores candidates via cost models of runtime and energy:

1
2
3
4
5
6
7
8
9
10
for each DNN layer (GEMM of M,K,N):
  for dataflow in {OS, WS, IS}:
    for logical shape (R_l,C_l) per Eq.(1):
      derive bank-allocation D_sta, D_non
      derive valid tile sizes M_t,K_t,N_t
      prune low-utilization tiles
  prune to 2000 candidates
  foreach candidate:
    estimate runtime, energy
  pick configuration minimizing EDP or latency
(Han et al., 2023)

No exhaustive enumeration (1010\sim10^{10} possibilities) is required; interval sampling in tile-size and buffer allocation space keeps candidate sets tractable.


References

  • "ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array" (Han et al., 2023)
  • "Systolic Computing on GPUs for Productive Performance" (Rong et al., 2020)
  • "SystolicAttention: Fusing FlashAttention within a Single Systolic Array" (Lin et al., 15 Jul 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flattened Systolic Array.