Hardware-Software Co-Design for Event-Driven SNN Deployment on Low-Cost Neuromorphic FPGAs

Published 24 Apr 2026 in cs.AR | (2604.22179v1)

Abstract: Low-cost FPGA platforms can broaden access to neuromorphic systems research, but current spiking neural network (SNN) workflows remain divided between hardware-first implementations, which are difficult to integrate with PyTorch-style development, and software-first frameworks, which often stop at simulation or GPU execution. This paper presents a semantics-preserving hardware-software co-design framework for the deterministic deployment of PyTorch-defined SNNs to event-driven FPGA execution. A single exported artifact carries weights, thresholds, connectivity descriptors, and grouped time-to-first-spike (TTFS) decoding metadata from software definition to board execution and is reused unchanged by both the software reference and the board runtime. A 10-class MNIST TTFS classifier implemented in the routed 80 MHz design achieves 87.40\% accuracy and matches the software reference on all 10,000 test images. The programmable-logic path delivers a service latency of 0.1375 μs/image and an estimated dynamic energy of 31.6 nJ/image, while scope-aware comparisons with matched GPU and CPU baselines keep accelerator-only and system-level measurements distinct. These results show that low-cost event-driven FPGA hardware can provide a direct and reproducible software-to-board path for software-defined SNN models.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a unified deployment artifact that preserves PyTorch semantics for both software and FPGA-based event-driven SNN inference.
It demonstrates hardware acceleration with 0.1375 μs latency per image, achieving 7.27×10^6 images/s throughput and 31.6 nJ energy per image.
Benchmark results show the FPGA is 1.79× faster and 933× more energy-efficient than GPUs while maintaining reproducible 87.40% MNIST accuracy.

Hardware–Software Co-Design for Event-Driven SNN Deployment on Low-Cost Neuromorphic FPGAs

Introduction

This paper presents a hardware–software co-design framework that achieves deterministic deployment of PyTorch-defined spiking neural networks (SNNs) on event-driven FPGA platforms, specifically targeting the low-cost PYNQ-Z2 development board. Existing SNN research pipelines are typically fragmented: hardware-first approaches are efficient but not readily integrated with contemporary machine learning software stacks, while software-first approaches built atop frameworks like PyTorch rarely progress beyond simulation or GPU-bound execution. The work addresses this disconnect via a semantics-preserving deployment path grounded in a single artifact that is reused without modification between both software and hardware inference, enabling reproducible, event-driven execution and explicit separation of measurement scopes.

Co-Design Framework and Deployment Path

The central innovation is a PyTorch-aligned deployment flow that exports model parameters, thresholds, connectivity descriptors, and grouped time-to-first-spike (TTFS) decoding metadata into a single artifact. This artifact is consumed identically by the software reference and the FPGA runtime, eliminating the divergence often seen in software-hardware co-design—specifically, it ensures that model semantics are consistent throughout the workflow and preserves PyTorch-like invocation patterns on the board.

The hardware architecture leverages the Xilinx Zynq-7020 SoC's programmable logic for the event router, core groups, connectivity table, grouped TTFS decoder, and latency counters. Up to 2,048 neurons are supported in the default RTL event-processing pipeline, with the system ultimately limited by BRAM occupancy. Timing closure is achieved at 80 MHz, with the hardware path designed specifically for event-driven, TTFS SNN inference.

Figure 1: The architecture bridges PyTorch-style development and neuromorphic FPGA inference with a single deployment artifact and shared semantics for both paths.

System-Path Latency Profiling

A key aspect of the framework is its scope-aware benchmarking. System-level latency is decomposed into software reference execution, spike packing, hardware execution, and synchronization/readback. Accelerator-only latency and host-inclusive latency are reported separately to allow fair cross-platform comparisons. On the PYNQ-Z2, accelerator-only (PL) service latency for one image is $0.1375\,\mu\mathrm{s}$ , yielding throughput of $7.27\times 10^6$ images/s, with a dynamic energy estimate of $31.6\,\mathrm{nJ}$ per image. The design achieves 87.40% accuracy on MNIST, precisely matching the software reference output for all 10,000 test images—a strict metric that emphasizes decision traceability rather than mere aggregate accuracy.

Figure 2: System-path latency analysis: the breakdown clarifies runtime cost distribution across reference computation, data marshaling, hardware execution, and I/O orchestration.

Cross-platform benchmarks using the same (exported) model on GPU (RTX 3080, both INT8 and FP32) and CPU (INT8/FP32) baselines reveal that the FPGA implementation is $1.79\times$ faster and $933\times$ more energy-efficient than the best GPU configuration (INT8), while the CPU is slower by more than two orders of magnitude.

Robustness to Input Sparsity

To evaluate robustness, the impact of input spike sparsity is systematically quantified. Progressive random spike-drop stress degrades accuracy gradually instead of causing abrupt failures: from 87.40% with no drop, to 86.31% at 25% spike drop, 82.38% at 50%, and 69.74% at 75%. Across five full-test-set runs (50,000 image-run pairs), no prediction mismatches were observed, and performance remained consistent, indicating high determinism and stable runtime characteristics.

Figure 3: Classifier accuracy as a function of controlled spike input sparsity, demonstrating graceful degradation under adversarial input conditions.

Practical and Theoretical Implications

Practically, the proposed framework democratizes neuromorphic hardware research by making event-driven SNN deployment accessible on widely available, affordable FPGA hardware, removing reliance on proprietary digital neuromorphic systems (e.g., Loihi). The use of a unified deployment artifact not only ensures reproducibility and functional equivalence across software and hardware but also streamlines system integration for closed-loop and hardware-in-the-loop experiments. By precisely delineating accelerator-only versus host-inclusive measurements, the work sets a rigorously fair standard for future benchmarking.

Theoretically, the deterministic mapping of PyTorch-defined TTFS SNNs to low-cost digital event-based hardware provides an extensible foundation for elaborating more complex architectures (e.g., convolutional SNNs, deeper multi-layer networks) on BRAM-constrained platforms. The methodology’s strong results under input sparsity stress position it as a practical vehicle for research in robust neuromorphic processing. However, scalability is ultimately limited by on-chip BRAM and synaptic routing overhead; overcoming these will require architectural and routing optimizations.

Future Directions

Opportunities for further research include:

Scaling to complex networks: Extending support to deeper and convolutional SNN architectures constrained by BRAM and routing fabric.
Hardware optimizations: Implementing more aggressive synapse routing and compression techniques to increase neuron and synapse capacity per design.
Generalization: Porting the flow to other low-cost FPGA families and supporting richer on-chip learning rules.
Application scope: Deploying real-time closed-loop or hardware-in-the-loop neuromorphic systems for edge sensing, control, and robotics exploitation.

Conclusion

The presented hardware–software co-design path establishes a reproducible, deterministic workflow for deploying PyTorch-defined, event-driven SNNs onto low-cost FPGA hardware. Full-test-set output agreement, explicit separation of measurement scopes, and high parallel throughput with low dynamic energy position the framework as a reference model for semantics-preserving neuromorphic deployment. The approach offers a rigorously transparent and replicable foundation for both SNN hardware research and applications requiring event-driven, energy-efficient computation.

Reference: "Hardware-Software Co-Design for Event-Driven SNN Deployment on Low-Cost Neuromorphic FPGAs" (2604.22179).

Markdown Report Issue