Systolic Scan Array (SSA) Overview
- Systolic Scan Array (SSA) is a specialized systolic array designed for efficient computation of matrix multiplication, convolution, and sequence analysis.
- It uses a regular 2D mesh of processing elements with local registers to maximize operand reuse and maintain steady-state throughput.
- Advanced design strategies, including diverse dataflow mappings, fault tolerance, and energy-efficient optimizations, drive its use in DNN acceleration and signal processing.
A Systolic Scan Array (SSA) is a specialized variant of the general systolic array concept, designed to enable efficient, high-throughput, pipelined computation of operations such as matrix multiplication, convolution, and sequence analysis by exploiting localized data movement and parallelism. An SSA typically consists of a regular, typically 2D, mesh of processing elements (PEs) that communicate through local registers, orchestrating a carefully synchronized data flow to maximize operand reuse and pipeline utilization. SSAs have found broad utility in domains spanning deep neural network (DNN) acceleration, signal processing, and sequence matching, frequently forming the architectural basis for modern hardware DNN accelerators, bioinformatics engines, and safety-critical computing cores.
1. Architectural Principles and Canonical Structures
The SSA paradigm is defined by several key attributes: local communication, regular interconnect topologies, and a spatiotemporal dataflow matching the target algorithm’s dependency graph. Classical SSAs for matrix multiplication or convolution instantiate each PE with multiply-accumulate (MAC) logic and local registers along the array’s grid, with data (e.g., activations, weights, partial sums) entering from the periphery and propagating synchronously. Each PE updates its state based on local inputs and neighbor transfers per cycle, enabling wavefront-style computation where, after an initial fill latency, the system attains a steady-state throughput. Complex pipelines—such as dual or triple modular redundancy (DMR/TMR) for fault tolerance (Cherezova et al., 6 Mar 2025), “tensor PE” fusions for block-level dot-product acceleration (Liu et al., 2020), or hybrid queue-linked register fabrics (Mazzola et al., 20 Feb 2024)—augment baseline designs for application-specific requirements.
2. Dataflow Strategies and Mathematical Mapping
SSA efficiency is profoundly determined by dataflow mapping. Three canonical strategies have emerged—the weight stationary (WS), input stationary (IS), and output stationary (OS) dataflows (Raja, 29 Oct 2024, Samajdar et al., 2018):
- In the WS flow, weights are preloaded and remain static in the array, while input activations and partial sums are streamed and updated temporally. The mapping is typically:
- , , (for an problem)
- Cycle count:
- IS and OS statically assign inputs or outputs, respectively, leading to alternative mappings and tradeoffs in operand movement and PE utilization.
The operational equation is always a variant of:
with the specific stationarity determining which operands maximize reuse and how pipeline fill/drain boundaries are handled.
Optimizing the assignment of matrix dimensions () to the spatial and temporal axes of the array is essential for both energy efficiency and throughput. Minimizing for fixed total computation maximizes efficiency—hence, mapping the problem’s smallest dimensions to the array’s spatial extent yields lowest total energy (Raja, 29 Oct 2024).
3. Design Space Exploration and Performance Optimization
Designing an optimal SSA involves multi-dimensional tradeoffs among array granularity, tiling, interconnect topology, and workload mapping.
- Granularity and Tiling: Subdividing large computational problems (e.g., matrix multiplications) to fit the array, balancing utilization and memory bandwidth. Non-divisor tiling factors are now recognized as critical for optimal resource efficiency; restricting to divisor-only tiles yields up to 39% performance loss (Wang et al., 2021).
- Interconnects: For multi-pod (“scale-out”) SSA topologies (Yüzügüler et al., 2022), butterfly networks (with expansion) provide scalable bisection bandwidth and low-latency routes for hundreds of pods, outperforming mesh and crossbar at scale.
- Offline Scheduling and Adaptive Sizing: Fixed-length tiling, aligned to optimal array dimensions (e.g., rectangular subarrays), maximizes pod utilization across CNN and Transformer workloads. Further, dataflow can be dynamically switched in hardware to tailor energy consumption to each network phase (Yüzügüler et al., 2022, Samajdar et al., 2018).
These tradeoffs are ideally explored through automated design frameworks (e.g., Odyssey (Wang et al., 2021)), which employ hybrid search, accurate latency dataflow modeling, and evolutionary mutation outside the limitations of earlier “prune by communication” heuristics.
4. Innovations in Data Movement and Energy Efficiency
Significant architectural research has focused on reducing power and memory bottlenecks through advanced dataflows, operand packing, and customized PE logic.
- Triangular Input Movement: The TrIM architecture implements a triangular path for inputs, exhibiting a reduction in total memory accesses over state-of-the-art row-stationary approaches, and up to better energy efficiency (Sestito et al., 5 Aug 2024). Key mathematical performance metrics:
where and are parallelism factors, is bits per operand.
- Structured and Unstructured Sparsity: Columns in sparse CNN filters are optimally packed (“column combining”) to increase nonzero density per array column, boosting utilization up to and energy efficiency up to without significant accuracy loss, retrainable on fractions of the full dataset (Kung et al., 2018). For unstructured sparsity, VUSA’s “virtual upscaling” enables virtual expansion of each row to columns with only MACs, yielding area and power savings (Helal et al., 1 Jun 2025).
- PE-Level Optimizations: Approximate MAC units, implemented through positive/negative partial product logic (PPC/NPPC), produce $22$– energy savings with bounded PSNR loss—well-suited for error-resilient applications (Jaswal et al., 31 Aug 2025). Meanwhile, tensor-PE fusions reduce per-MAC accumulator and register overhead (Liu et al., 2020).
5. Reliability, Fault Tolerance, and Adaptive Execution
With DNN deployment in safety-critical domains, SSAs have evolved to address both transient and permanent fault resilience.
- Run-Time Reconfigurability: FORTALESA introduces three execution modes—baseline (no redundancy), DMR, and TMR—plus four implementation options, efficiently mapping NNs by vulnerability analysis to heterogeneous redundancy levels. Its analytic fault propagation model replaces slow RTL simulation, yielding resource savings vs. static TMR and up to speedup (Cherezova et al., 6 Mar 2025).
- Hierarchical Fault Injection and Model-Based Assessment: SAFFIRA’s URE-based modeling and hierarchical, hardware-aware FI achieves acceleration of DNN accelerator reliability assessment over RTL injection, offering fault-distance metrics for output drift and immediate trajectory mapping for propagated errors (Taheri et al., 5 Mar 2024).
6. Floorplanning, Multi-Core Hybrids, and Emerging Directions
Physical design and flexible deployment undergird the future trajectory of SSAs:
- Asymmetric Floorplanning: By aligning PE aspect ratios to bus widths and switching activity (), interconnect power is minimized—e.g., by for ResNet50-like layers—yielding gains in power density critical for chip-scale and edge designs (Peltekis et al., 2023).
- Hybrid Systolic-Manycore Architectures: Embedding “virtual” systolic networks within shared-memory manycore systems via hardware queues and queue-linked registers (QLR) doubles compute utilization and boosts energy efficiency up to (Mazzola et al., 20 Feb 2024), blending the programmability of CPUs with the efficiency of fixed arrays.
7. Application Domains and Impact
SSAs are foundational in DNN accelerators (serving both vision and LLMs), hardware-accelerated signal processing, genome sequence matching (enabling exact/approximate substring search through pipelined motif tree matching (Rice et al., 2010)), communications (enabling lattice-reduction-aided MIMO via parallelized LLL variants (Wang et al., 2011)), fault-tolerant inference for safety-critical embedded systems, and general-purpose edge AI cores. Their sustained research focus remains on balancing array utilization, memory hierarchy bottlenecks, power density, resilience, and programmability in an era of increasingly heterogeneous and data-intensive workloads.