Flattened Systolic Array Architecture
- Flattened systolic array architecture is a configurable two-dimensional grid of processing elements designed with non-square aspect ratios to match diverse DNN workloads.
- It employs dynamic dataflow mappings (output, weight, and input stationary) and optimized memory partitioning to enhance computational throughput and energy efficiency.
- Empirical studies using cycle-accurate simulators reveal that these arrays offer adaptable performance trade-offs, balancing resource use, bandwidth, and runtime for varied deep learning layers.
A flattened systolic array architecture is characterized by a two-dimensional array of processing elements (PEs) designed with an aspect ratio and physical connectivity that depart from the classic square, hierarchical grid, enabling more flexible mapping to diverse deep neural network (DNN) workloads. This architecture emphasizes the explicit tuning of array aspect ratios, reconfigurable partitioning, and dataflow mapping strategies to optimize computational resource utilization, bandwidth, and energy efficiency—all while maintaining or improving throughput across varied application domains. Flattened systolic arrays have become central to the evolution of domain-specific accelerators for DNNs, where the array geometry, PE mapping, buffer sizing, and microarchitectural decisions are chosen holistically to balance trade-offs among hardware overhead, runtime efficiency, memory bandwidth, and workload-dependent PE utilization.
1. Array Geometry, Aspect Ratio, and Flattened Design Space
The defining feature of a flattened systolic array is its break with the classical square (N×N) PE grid. Arrays are instantiated with a non-square (flattened) geometry, where the width and height can be tuned separately:
- The aspect ratio (width:height) is set according to workload hyperparameters, dataflow, and memory/hardware constraints;
- The design space includes, for example, 8×2048, 128×32, and other “wide” or “tall” arrays—either at fixed PE count or under area/power limitations.
This flexible array shape enables designers to target DNN layers or domains whose operand matrices are inherently very rectangular. Evidence from design space exploration using cycle-accurate simulators such as SCALE-Sim shows that this geometrical flexibility is a major determinant of performance and energy metrics, particularly when combined with dynamic dataflow mappings. For example, a non-square (flattened) array tailored for output-stationary mapping can outperform a square array in both utilization and bandwidth, depending on workload dimensionality.
Aspect ratio selection is not independent of mapping and memory sizing: scaling-up (increasing the size of a single, possibly flattened, array) and scaling-out (deploying multiple smaller arrays in parallel) yield different trade-offs in bandwidth, synchronization, and utilization (Samajdar et al., 2018). Flattened architectures allow for rigorous exploration of these options.
2. Dataflow Mappings and PE Utilization
Flattened systolic arrays support multiple canonical dataflow strategies, each with unique interactions with array shape and workload patterns:
- Output Stationary (OS): Each PE accumulates a single output pixel; maximizes compute density and reuse but may increase SRAM banking requirements and remapping frequency.
- Weight Stationary (WS): Weights are statically mapped to PEs; reduces buffer complexity but may suffer increased remap costs.
- Input Stationary (IS): Inputs are statically mapped to PEs; convolution reductions align with columns and may suit certain flat array aspect ratios better.
The specific interaction between array aspect ratio and mapping is non-linear. For example, an OS mapping on a highly elongated array might produce superior runtime for layers with high output width, while IS or WS may show best energy/area trade-offs for layers dominated by input or filter bandwidth (Samajdar et al., 2018).
Case studies using hardware simulators indicate that, even at fixed PE count, reshaping the array (e.g., from 32×128 to 128×32) can markedly vary PE utilization, leading to significant differences in performance ("aspect_ratio_runtime" analysis). Thus, optimal mapping must be co-selected alongside the array’s “flatness” for each workload.
3. Memory and System-Level Integration
Memory organization and system integration are fundamental to realizing the benefits of flattened arrays:
- On-chip scratchpad sizing for IFMAP, filter, and OFMAP partitions must be dimensioned to match the chosen array aspect ratio and anticipated PE data reuse. Increasing scratchpad size improves local reuse but yields diminishing returns beyond architectural “knees” specific to the workload (Samajdar et al., 2018).
- Memory controllers and system interconnects are modelled as part of a system-level performance path; the accelerator is often a co-processor with its own DRAM access patterns. Flattened arrays may move more (or less) data per unit time than square arrays, changing DRAM bandwidth and energy requirements non-uniformly.
SCALE-Sim and similar frameworks generate cycle-accurate traffic traces for both SRAM and DRAM, allowing explicit measurement of reconfiguration-induced bandwidth savings (or costs). The “bursty” nature of output transfers is directly affected by both array geometry and mapping, which can shift peak bandwidth requirements and stall behavior.
4. Design Trade-Offs and Tuning
A key insight from empirical studies is that first- and second-order trade-offs in flattened systolic arrays are tightly coupled and seldom monotonic:
- Dataflow mapping choice vs. resource complexity (OS may require more complex SRAM banking; IS/WS reduce bank count but increase other remapping costs).
- Memory provision vs. area/power: Performance “knees” suggest over-provisioning memory for flattened arrays is wasteful if utilization stalls are not resolved.
- Compute scaling (shape and size): Scaling-up (increasing a flattened array’s size) and scaling-out (multiple small arrays) can optimize different axes (runtime, energy, bandwidth) according to the mapped application.
No single mapping or array geometry is universally best; the optimal configuration is highly workload-dependent. For example, the most favorable “flat” array for ResNet-50 may be quite different from that for MobileNet, depending on layer shapes and bandwidth bottlenecks (Samajdar et al., 2018).
Table: Example Trade-Offs in Flattened Systolic Arrays
Array Shape | Best Dataflow | Runtime |
---|---|---|
Flat, wide (8×2048) | OS | Low (for wide outputs) |
Flat, tall (2048×8) | IS or WS | Low (for tall inputs) |
Square (128×128) | OS/WS/IS | Balanced |
Each entry’s optimality is contingent on the activation and filter sizes of the input layer, and system bandwidth characteristics.
5. Empirical Case Study: Tools and Metrics
Hardware simulators such as SCALE-Sim provide exhaustive exploration of the design space. Key features relevant for flattened architectures include:
- Customizable array width and height (non-square support).
- Configurable dataflow (OS, WS, IS) per simulation run.
- Independent scratchpad partitioning and double buffering for latency hiding.
- Cycle-accurate traces of all memory accesses, with output metrics including average cycle time per output, SRAM/DRAM bandwidth, energy estimates (Samajdar et al., 2018).
Simulators enable empirical “what-if” analysis so that a designer can—for a given DNN layer—evaluate how flattened array shape, mapping, and memory partitioning affect system-level performance and energy constraints. These tools provide evidence that intermediate array shapes, rather than solely square or highly elongated ones, may be most advantageous in practice.
Relevant formulas encapsulate the underlying computation, such as matrix multiplication and convolution:
These equations translate the mapped operation into PE-level multiply-accumulate tasks, with specific pins (which axis is flattened, which operand is stationary) determined by mapping and array aspect ratio.
6. Implications for Future Accelerators
The findings indicate that flattened systolic arrays, with co-optimized dataflow, memory, and aspect ratio, are essential for achieving efficient, workload-adaptive deep learning acceleration:
- Emerging DNN workloads with highly variable hyperparameters or non-uniform layer shapes demand the flexibility available only in flattened array architectures (Samajdar et al., 2018).
- Optimal performance and energy efficiency cannot be captured by a “one size fits all” approach; instead, the hardware needs to be tunable (through either architectural overlays or runtime reconfiguration) to maximize resource utilization per layer.
- Holistic design and simulation approaches are necessary to precisely tune system integration parameters (e.g., DRAM bandwidth) in the context of flattened array geometries.
A plausible implication is that future array-based accelerators will exploit runtime-adaptive or compiler-synthesized array flattening, leveraging domain knowledge or simulation feedback to select configuration per DNN layer. This aligns with trends in system co-design and hardware-aware deep learning compilation workflows.
7. Summary
Flattened systolic array architectures break from the rigidity of classic, square arrays by enabling proactive, workload-aware tuning of aspect ratio, PE mapping, memory allocation, and system integration. Efficient mapping strategies (OS/WS/IS) combined with flexible array geometry facilitate improved data reuse, reduced memory bandwidth demands, and enhanced PE utilization. Empirical analyses—supported by detailed simulation methodologies—demonstrate that such architectures are critical to achieving scalable, energy-efficient, and high-throughput DNN inference and training. The optimization of flattened arrays must be approached as a multidimensional co-design problem, with close coupling between array shape, mapping, memory, and system parameters, tailored to the evolving diversity of DNN workloads.