Flattened Systolic Array Architecture
- Flattened systolic arrays are reconfigurable designs that transform 2D grids into 1D or elongated pipelines while preserving native dataflow.
- They employ hardware microarchitectural enhancements and software space–time mapping to optimize performance for asymmetric workloads such as GEMM and LSTMs.
- Empirical evaluations demonstrate notable speedups and energy efficiency improvements, achieving up to 4.6× performance gains and 8.3× EDP reduction.
A flattened systolic array is a programmable hardware or software architecture in which an originally two-dimensional (or N-dimensional) systolic array is dynamically reconfigured—through logical re-shaping, space–time transformation, or microarchitectural enhancements—into one-dimensional or highly elongated forms, while preserving the native systolic dataflow and maximizing hardware utilization for workloads exhibiting strong dimensional imbalance. Flattened systolic arrays include both hardware-based schemes such as ReDas and FSA, which mutate physical array topologies, and software-virtualized models using linear space–time projections mapped onto generic parallel processors.
1. Definition and Formalism
A flattened systolic array is an instantiation of systolic computing in which the arrangement of processing elements (PEs) is dynamically reconfigured from a canonical grid shape (e.g., ) into extreme aspect ratios (e.g., or ), or in software, into a single-lane or vectorized schedule via linear mapping of the computational dependence graph. The configuration is typically governed by a projection and a schedule , such that the executing iteration is mapped onto PE index and time . The correctness of such flattenings relies on space–time transformation matrices and recurrence equations that preserve causality and avoid PE-time conflicts (Rong et al., 2020).
For hardware arrays, such as ReDas, the flattening is realized via reconfigurable data paths and local interconnects, supporting up to 129 logical shapes for a array (including all forms through ). Software flattening, as exemplified in T2S-GPU, allows the full N-dimensional schedule to be mapped onto linear pipelines exploiting GPU SIMD registers (Rong et al., 2020).
2. Microarchitectural Mechanisms for Array Flattening
Reconfigurable Datapath Topologies
ReDas implements flattened systolic forms by partitioning the physical array into four contiguous sub-arrays, each reconfigured at runtime via short periphery links into a single logical "ring." Each PE only requires bidirectional short-range connections to direct neighbors—avoiding long multi-hop wires—thus containing hardware overhead. The logical shape constraint is defined as:
where is the logical array dimension. This formulation yields all flattened and non-square configurations for a array (Han et al., 2023).
PE Microarchitecture
Each PE in ReDas presents four bidirectional ports (north/south/east/west). Input arbitration and operand ordering within the PE is managed through fine-grained crossbars and output multiplexers:
- Crossbar : classifies data as operational (to MAC) or bypass (to turn or go straight).
- Crossbar : sequences the three possible MAC operands into the multiplier–adder.
- Output MUX : selects from up to four output streams.
An indicative pseudo-code within the PE in a given cycle is:
1 2 3 4 5 6 7 |
inputs ← listen(north, south, east, west) {opA, opB, pass1, pass2} ← crossbar₁(inputs) mac_out ← MAC(opA, opB, stationary_reg) {pass1', pass2'} ← corner_unit(pass1, pass2, shape_info) outputs ← mux(named={mac_out, pass1', pass2', ’idle’}) drive(north, south, east, west) ← outputs stationary_reg ← update_if_needed(mac_out) # WS/IS modes only |
This facilitates maintaining correct dataflow for output-stationary (OS), weight-stationary (WS), or input-stationary (IS) mappings, regardless of flattening (Han et al., 2023).
3. Logical Shape Flexibility and Dataflow Preservation
Flattened arrays support a family of logical shapes with extreme aspect ratios, including:
- (a chain: fully "flattened" row)
- (a chain: fully "flattened" column)
For each logical shape, the mapping of a workload (e.g., a GEMM of by ) onto the array is governed by the bank allocation of on-chip memory (multi-mode buffers) and by tile factors chosen to match up to double-buffering constraints.
The dataflow is preserved with minor timer adjustments: in OS mode, execution time per tile
where the term represents the bypass cycles incurred at the turns between sub-arrays in the ringed configuration (Han et al., 2023).
4. Software Flattened Arrays and Space–Time Mapping
In a software context, a flattened (1-D) systolic array arises by mapping the full computational dependence graph into a single dimension (PE index and scheduled time ) via a space–time transformation matrix :
with iteration block . This is mathematically equivalent to projecting the N-dimensional uniform recurrence equation (URE) into a 1-D pipeline, as shown for convolutions:
- For 1-D convolution, e.g., , flattening is achieved via ; the backbone FMA and shift operations correspond to high-throughput SIMD vector instructions (Rong et al., 2020).
- For 2-D convolution, a suitable transformation flattens the $4$-D URE over into : , .
On practical GPU platforms, each thread manages a register vector of length (the logical array dimension, constrained by the SIMD width and available registers), wherein temporal forwarding and register shuffling realize the logical nearest-neighbor communications of the flattened systolic pipeline (Rong et al., 2020).
5. Flattened Arrays in Specialized Workloads
Flattening is particularly advantageous for highly asymmetric workloads, such as matrix-vector multiplies (where ) or LSTM layers. In ReDas, if a GEMM has , , mapping to a ring (fully flattened row) enables nearly all rows to be active, compared to only 1 out of 128 columns in a conventional shape. Empirically, DeepSpeech2 LSTM layers achieved speedup in such a configuration. Across evaluated DNNs and computer vision models (e.g., TinyYOLO-v2), optimal mappers prefer extremely flat or modestly stretched aspect ratios for nearly every layer; statically square shapes rarely produce optimal utilization (Han et al., 2023).
In the context of attention mechanisms, FSA extends the concept of flattening not by changing physical shape, but by fusing the entire sequence of FlashAttention operations into one deeply pipelined 2D array, eliminating all dataflow detours to vector units and minimizing register/SRAM port contention (Lin et al., 15 Jul 2025).
6. Performance Implications and Quantitative Outcomes
Flattened systolic arrays deliver significant performance and energy efficiency improvements. For hardware flattening, ReDas achieves on average speedup and EDP reduction on real DNN workloads, with two-level PE extension (crossbars, multiplexers) and multi-mode SRAM buffers adding minimal area and no multi-ported requirements (Han et al., 2023). In software, flat systolic arrays (T2S-GPU) obtain up to of the performance of specialized hardware for 1-D convolution and near utilization of SIMD lanes (Rong et al., 2020).
FSA, with modest () area overhead versus a standard array, reports higher attention throughput than TPUv5e and that of AWS NeuronCore-v2, with mean relative error in fused operations (Lin et al., 15 Jul 2025). This is achieved by executing all stages of FlashAttention in a unified per-PE pipeline.
A rule of thumb, as observed in ReDas, is that if (such as for a $128$-wide array), full flattening should be considered to maximize PE occupancy. Mapper-driven selection of logical shape nearly always yields – speedups and up to EDP improvement for layers matching these criteria.
7. Practical Mapper Algorithms and Resource Allocation
Selecting optimal flattening and array shape is managed by an analytical-layer-by-layer mapper, which evaluates all feasible logical shapes, dataflows, and memory bank allocations, and scores candidates via cost models of runtime and energy:
1 2 3 4 5 6 7 8 9 10 |
for each DNN layer (GEMM of M,K,N): for dataflow in {OS, WS, IS}: for logical shape (R_l,C_l) per Eq.(1): derive bank-allocation D_sta, D_non derive valid tile sizes M_t,K_t,N_t prune low-utilization tiles prune to ≲2000 candidates foreach candidate: estimate runtime, energy pick configuration minimizing EDP or latency |
No exhaustive enumeration ( possibilities) is required; interval sampling in tile-size and buffer allocation space keeps candidate sets tractable.
References
- "ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array" (Han et al., 2023)
- "Systolic Computing on GPUs for Productive Performance" (Rong et al., 2020)
- "SystolicAttention: Fusing FlashAttention within a Single Systolic Array" (Lin et al., 15 Jul 2025)