Sparsity-Aware Output-Channel Dataflow Streaming

Updated 13 January 2026

SAOCDS is a streaming accelerator architecture for spiking neural networks that exploits both input and weight sparsity to enable high-throughput, control-free execution.
It integrates fixed-weight scheduling with the gated one-to-all product algorithm for pipelined, deterministic processing across network layers in cognitive radio systems.
By using COO-encoded weight storage and fully pipelined output-channel dataflow, SAOCDS achieves doubled throughput and reduced dynamic power on FPGA implementations.

Sparsity-Aware Output-Channel Dataflow Streaming (SAOCDS) refers to a streaming accelerator architecture for spiking neural networks (SNNs) that maximally exploits both input and weight sparsity while maintaining high throughput and low power, particularly in the context of automatic modulation classification (AMC) for cognitive radio systems. Unlike prior array-based and streaming designs, SAOCDS integrates fixed-weight scheduling with the gated one-to-all product (GOAP) algorithm, allowing completely control-free, pipelined execution across SNN layers. Key optimization strategies include coordinate (COO)-encoded weight storage and fully pipelined output-channel dataflow, yielding measurable improvements in energy efficiency, throughput, and resource utilization on FPGA hardware (Yang et al., 6 Jan 2026).

1. Block-Level Architecture

SAOCDS comprises a sequence of pipelined hardware modules, each corresponding to a network layer and directly streaming data to the next—no global scheduler or router is required. Each layer module consists of the following components:

Input Buffer (IFM Buffer): A compact SRAM bank organized as spike bit-vectors per input channel.
Matrix-Vector Thresholding Unit (MVTU): Contains $P$ processing elements (PEs), which maintain per-output-channel membrane potentials, perform gated accumulation, and implement LIF neuron thresholding via local comparators and reset circuits.
Weight Memory (COO format): Stores nonzero weights as (row-index, col-index, value) triples, enabling direct iteration over sparse elements.
Output FIFO: Each PE streams output spikes into the next layer's input buffer through point-to-point FIFOs.

Block Diagram

1	IFM Buffer → MVTU (P PEs + per-PE Weight Memory) → OFM FIFO → next-layer IFM Buffer

This modularity supports static interconnects without crossbar or bus logic, ensuring deterministic, fully pipelined execution.

2. Dataflow Control and Comparison to Prior Designs

Conventional systolic arrays employ a 2D mesh of reconfigurable PEs, with complex global control and routing for input and weight management, often requiring centralized sparsity control. FINN-style pure streaming architectures, while high-throughput, lack mechanisms for leveraging weight sparsity, resulting in inefficiencies.

SAOCDS introduces output-channel dataflow, wherein iteration proceeds in strict output-channel index order:

Only nonzero weights (spatial sparsity) and nonzero input spikes (temporal sparsity) are processed.
Scheduling of reads/writes is statically determined, eliminating dynamic control and arbitration.
Load balancing across PEs ensures uniform workload as each nonzero weight contributes identically (via its Enable Map) to output-accumulation.

This design achieves simultaneous exploitation of spatial and temporal sparsity, yielding higher efficiency than prior designs.

3. GOAP Algorithm and Mathematical Formulation

The GOAP algorithm forms the mathematical backbone of SAOCDS processing. Weights $W$ are stored as sparse COO triples:

$W[\text{nnz}] = (D[\text{nnz}], RI[\text{nnz}], CI[\text{nnz}])$ .
$D[\text{nnz}] \in \mathbb{Z}_{16}$ (nonzero weight), $RI[\text{nnz}]$ encodes $(\text{oc},\text{ic})$ , and $CI[\text{nnz}]$ is the spatial offset for output pixels.

Enable Map $EM[n]$ specifies valid output indices for each weight. At timestep $t$ , input spikes are $I_t[\text{ic}][oi] \in \{0,1\}$ . The membrane potential for $(oc,oi)$ is updated:

$U_t(oc,oi) = \alpha \cdot U_{t-1}(oc,oi) + \sum_{n \in N(oc)} D[n] \cdot I_t[ic[n]][oi + CI[n]] - \theta \cdot S_{t-1}(oc,oi)$

where spike generation is:

$S_t(oc,oi) = \begin{cases} 1 & \text{if}\ U_t(oc,oi) > U_{\mathrm{th0}} \ 0 & \text{otherwise} \end{cases}$

Each output channel's computation proceeds in a statically scheduled loop, with precomputed empty/extra iterations guaranteeing deterministic resource use.

4. Dataflow Scheduling and Pipelining

Algorithmic implementation is captured in the following pseudocode, ensuring seamless convolutional layer operation:

for t in 0…T−1:                    # time-step loop
    IC_read ← 0
    pre_oc   ← –1
    for nnz in 0…NNZ−1:            # iterate over all nonzero weights
        this_oc ← floor(W[nnz].RI / IC)
        ic      ← W[nnz].RI mod IC
        if IC_read < IC:
            I_buf[IC_read] ← read next input channel
            IC_read ← IC_read + 1
        if this_oc ≠ pre_oc:
            V ← DRAM.read(membrane_states[this_oc])
            V ← α·V
        for oi in EM_range:
            if I_buf[ic][oi + W[nnz].CI] == 1:
                V[oi] ← V[oi] + D[nnz]
        if next_oc ≠ this_oc:
            S ← (V > U_th0)
            V ← V – θ·S
            write OFM channel this_oc ← S
            DRAM.write(membrane_states[this_oc], V)
        pre_oc ← this_oc

Empty and extra iterations are precomputed from the RI index distribution, forming a fixed loop count at compile time and eliminating runtime control.

With each layer streaming its output channels in ascending order, inter-layer arbitration is not needed; FIFOs sized at compile time guarantee throughput matching and complete pipeline efficiency.

5. FPGA Implementation and Performance

SAOCDS was implemented on a Xilinx Virtex-7 VC709 board, synthesized with Vivado HLS and Vivado 2020.1. Key hardware features include:

Data precision: 16-bit signed fixed-point for weights/membrane potentials; 1-bit spikes.
Resource utilization: LUTs: 82,859 (≈ 21% VC709), FFs: 44,906, BRAM18K: 96.5, DSP48E: 297.
Clock frequency: $F_\mathrm{max} = 137$ MHz, bottlenecked by per-PE gated accumulation.

Optimization strategies employed:

Fully unrolled accumulation loops (vectorized bit-wise read).
Local register banking.
Compile-time flattening of NNZ iteration.

Measured results on the RadioML 2016 dataset:

Accelerator	Throughput (MS/s)	Dynamic Power (W)	Accuracy (%)
SAOCDS	23.5	0.473	86
FINN-based baseline	11.45	1.146	86

SAOCDS achieves approximately 2× the throughput at less than 42% of the dynamic power, with identical classification accuracy (Yang et al., 6 Jan 2026).

6. Efficiency Metrics and Scalability

Energy and hardware efficiency metrics for SAOCDS include:

Energy per sample: 0.473 W / 23.5 MS/s ≈ 20.1 nJ/sample.
Figure-of-Merit (FoM): $82,859 \times 0.473\, \text{W} / 23.5\,\text{MS/s} \approx 1,667\ \mu\text{J/s}$ .

By comparison, the FINN-based prior records FoM ≈ 7,464 μJ/s, demonstrating a 4.5× improvement.

Scalability considerations:

GOAP iteration overhead: Approaches that of full-density streaming at high density, but mitigated in FC layers by weight masking.
Empty/extra iteration penalty: At extreme sparsity, can reach ~10% loop overhead, yet remains small compared to memory/compute savings.
Larger networks: NNZ scales linearly with channel count/kernels; sustained $F_\mathrm{max}$ may require deeper pipelining or multi-bank memories.
ASIC targets: Custom COO packing could further minimize memory bandwidth, potentially exceeding 300 MHz $F_\mathrm{max}$ .
Smaller FPGAs: Resource sharing may constrain parallelism, impacting overall pipeline throughput.

7. Limitations and Future Directions

While SAOCDS demonstrates architectural advantages, several limitations warrant consideration:

At very high density, the benefits of sparsity-aware iteration lessen due to increased loop control complexity and marginal latency increases versus sliding-window streaming.
Extreme sparsity introduces nontrivial empty/extra iteration overhead, though overall efficiency remains favorable.
Potential hardware limitations on compact FPGAs can necessitate architectural alterations, such as resource sharing and reduced parallelism, impacting throughput.
Future enhancements may include adaptive run-time sparsity tracking or dynamic memory packing on ASIC platforms to push frequency beyond current FPGA design limits.

This suggests that while SAOCDS is well-suited for real-time, energy-constrained edge deployment in AMC settings, further architectural evolution and optimization may extend its applicability across larger SNNs and diverse hardware platforms (Yang et al., 6 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Sparsity-Aware Streaming SNN Accelerator with Output-Channel Dataflow for Automatic Modulation Classification (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparsity-Aware Output-Channel Dataflow Streaming (SAOCDS).

Sparsity-Aware Output-Channel Dataflow Streaming

1. Block-Level Architecture

Block Diagram

2. Dataflow Control and Comparison to Prior Designs

3. GOAP Algorithm and Mathematical Formulation

4. Dataflow Scheduling and Pipelining

5. FPGA Implementation and Performance

6. Efficiency Metrics and Scalability

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sparsity-Aware Output-Channel Dataflow Streaming

1. Block-Level Architecture

Block Diagram

2. Dataflow Control and Comparison to Prior Designs

3. GOAP Algorithm and Mathematical Formulation

4. Dataflow Scheduling and Pipelining

5. FPGA Implementation and Performance

6. Efficiency Metrics and Scalability

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research