Sparsity-Aware Output-Channel Dataflow Streaming
- SAOCDS is a streaming accelerator architecture for spiking neural networks that exploits both input and weight sparsity to enable high-throughput, control-free execution.
- It integrates fixed-weight scheduling with the gated one-to-all product algorithm for pipelined, deterministic processing across network layers in cognitive radio systems.
- By using COO-encoded weight storage and fully pipelined output-channel dataflow, SAOCDS achieves doubled throughput and reduced dynamic power on FPGA implementations.
Sparsity-Aware Output-Channel Dataflow Streaming (SAOCDS) refers to a streaming accelerator architecture for spiking neural networks (SNNs) that maximally exploits both input and weight sparsity while maintaining high throughput and low power, particularly in the context of automatic modulation classification (AMC) for cognitive radio systems. Unlike prior array-based and streaming designs, SAOCDS integrates fixed-weight scheduling with the gated one-to-all product (GOAP) algorithm, allowing completely control-free, pipelined execution across SNN layers. Key optimization strategies include coordinate (COO)-encoded weight storage and fully pipelined output-channel dataflow, yielding measurable improvements in energy efficiency, throughput, and resource utilization on FPGA hardware (Yang et al., 6 Jan 2026).
1. Block-Level Architecture
SAOCDS comprises a sequence of pipelined hardware modules, each corresponding to a network layer and directly streaming data to the next—no global scheduler or router is required. Each layer module consists of the following components:
- Input Buffer (IFM Buffer): A compact SRAM bank organized as spike bit-vectors per input channel.
- Matrix-Vector Thresholding Unit (MVTU): Contains processing elements (PEs), which maintain per-output-channel membrane potentials, perform gated accumulation, and implement LIF neuron thresholding via local comparators and reset circuits.
- Weight Memory (COO format): Stores nonzero weights as (row-index, col-index, value) triples, enabling direct iteration over sparse elements.
- Output FIFO: Each PE streams output spikes into the next layer's input buffer through point-to-point FIFOs.
Block Diagram
1 |
IFM Buffer → MVTU (P PEs + per-PE Weight Memory) → OFM FIFO → next-layer IFM Buffer |
This modularity supports static interconnects without crossbar or bus logic, ensuring deterministic, fully pipelined execution.
2. Dataflow Control and Comparison to Prior Designs
Conventional systolic arrays employ a 2D mesh of reconfigurable PEs, with complex global control and routing for input and weight management, often requiring centralized sparsity control. FINN-style pure streaming architectures, while high-throughput, lack mechanisms for leveraging weight sparsity, resulting in inefficiencies.
SAOCDS introduces output-channel dataflow, wherein iteration proceeds in strict output-channel index order:
- Only nonzero weights (spatial sparsity) and nonzero input spikes (temporal sparsity) are processed.
- Scheduling of reads/writes is statically determined, eliminating dynamic control and arbitration.
- Load balancing across PEs ensures uniform workload as each nonzero weight contributes identically (via its Enable Map) to output-accumulation.
This design achieves simultaneous exploitation of spatial and temporal sparsity, yielding higher efficiency than prior designs.
3. GOAP Algorithm and Mathematical Formulation
The GOAP algorithm forms the mathematical backbone of SAOCDS processing. Weights are stored as sparse COO triples:
- .
- (nonzero weight), encodes , and is the spatial offset for output pixels.
Enable Map specifies valid output indices for each weight. At timestep , input spikes are . The membrane potential for is updated:
where spike generation is:
Each output channel's computation proceeds in a statically scheduled loop, with precomputed empty/extra iterations guaranteeing deterministic resource use.
4. Dataflow Scheduling and Pipelining
Algorithmic implementation is captured in the following pseudocode, ensuring seamless convolutional layer operation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
for t in 0…T−1: # time-step loop IC_read ← 0 pre_oc ← –1 for nnz in 0…NNZ−1: # iterate over all nonzero weights this_oc ← floor(W[nnz].RI / IC) ic ← W[nnz].RI mod IC if IC_read < IC: I_buf[IC_read] ← read next input channel IC_read ← IC_read + 1 if this_oc ≠ pre_oc: V ← DRAM.read(membrane_states[this_oc]) V ← α·V for oi in EM_range: if I_buf[ic][oi + W[nnz].CI] == 1: V[oi] ← V[oi] + D[nnz] if next_oc ≠ this_oc: S ← (V > U_th0) V ← V – θ·S write OFM channel this_oc ← S DRAM.write(membrane_states[this_oc], V) pre_oc ← this_oc |
Empty and extra iterations are precomputed from the RI index distribution, forming a fixed loop count at compile time and eliminating runtime control.
With each layer streaming its output channels in ascending order, inter-layer arbitration is not needed; FIFOs sized at compile time guarantee throughput matching and complete pipeline efficiency.
5. FPGA Implementation and Performance
SAOCDS was implemented on a Xilinx Virtex-7 VC709 board, synthesized with Vivado HLS and Vivado 2020.1. Key hardware features include:
- Data precision: 16-bit signed fixed-point for weights/membrane potentials; 1-bit spikes.
- Resource utilization: LUTs: 82,859 (≈ 21% VC709), FFs: 44,906, BRAM18K: 96.5, DSP48E: 297.
- Clock frequency: MHz, bottlenecked by per-PE gated accumulation.
Optimization strategies employed:
- Fully unrolled accumulation loops (vectorized bit-wise read).
- Local register banking.
- Compile-time flattening of NNZ iteration.
Measured results on the RadioML 2016 dataset:
| Accelerator | Throughput (MS/s) | Dynamic Power (W) | Accuracy (%) |
|---|---|---|---|
| SAOCDS | 23.5 | 0.473 | 86 |
| FINN-based baseline | 11.45 | 1.146 | 86 |
SAOCDS achieves approximately 2× the throughput at less than 42% of the dynamic power, with identical classification accuracy (Yang et al., 6 Jan 2026).
6. Efficiency Metrics and Scalability
Energy and hardware efficiency metrics for SAOCDS include:
- Energy per sample: 0.473 W / 23.5 MS/s ≈ 20.1 nJ/sample.
- Figure-of-Merit (FoM): .
By comparison, the FINN-based prior records FoM ≈ 7,464 μJ/s, demonstrating a 4.5× improvement.
Scalability considerations:
- GOAP iteration overhead: Approaches that of full-density streaming at high density, but mitigated in FC layers by weight masking.
- Empty/extra iteration penalty: At extreme sparsity, can reach ~10% loop overhead, yet remains small compared to memory/compute savings.
- Larger networks: NNZ scales linearly with channel count/kernels; sustained may require deeper pipelining or multi-bank memories.
- ASIC targets: Custom COO packing could further minimize memory bandwidth, potentially exceeding 300 MHz .
- Smaller FPGAs: Resource sharing may constrain parallelism, impacting overall pipeline throughput.
7. Limitations and Future Directions
While SAOCDS demonstrates architectural advantages, several limitations warrant consideration:
- At very high density, the benefits of sparsity-aware iteration lessen due to increased loop control complexity and marginal latency increases versus sliding-window streaming.
- Extreme sparsity introduces nontrivial empty/extra iteration overhead, though overall efficiency remains favorable.
- Potential hardware limitations on compact FPGAs can necessitate architectural alterations, such as resource sharing and reduced parallelism, impacting throughput.
- Future enhancements may include adaptive run-time sparsity tracking or dynamic memory packing on ASIC platforms to push frequency beyond current FPGA design limits.
This suggests that while SAOCDS is well-suited for real-time, energy-constrained edge deployment in AMC settings, further architectural evolution and optimization may extend its applicability across larger SNNs and diverse hardware platforms (Yang et al., 6 Jan 2026).