Embedded FPGA Accelerator for BCPNN

Updated 4 July 2026

The paper introduces an embedded FPGA accelerator for BCPNN that reorganizes Bayesian-Hebbian learning into stream-oriented kernels and burst-based memory access.
It demonstrates significant performance improvements and energy reductions by mapping dense matrix operations into pipelined, parallel structures on FPGA platforms.
The design addresses memory bandwidth and resource constraints, enabling both online learning and inference-only modes for low-power, edge-based applications.

Searching arXiv for the specified BCPNN FPGA papers to ground the article in the latest relevant literature. {"query":"BCPNN FPGA StreamBrain Embedded FPGA Acceleration of Brain-Like Neural Networks arXiv", "max_results": 10} Embedded FPGA acceleration for Bayesian Confidence Propagation Neural Network (BCPNN) denotes a class of hardware–software designs that map the rate-based, Bayesian-Hebbian learning mechanisms of BCPNN onto FPGA dataflow architectures, particularly for low-power and edge deployment. In this literature, BCPNN is treated not as a backpropagation model but as a brain-like neural network whose synaptic parameters are derived from local probability traces, while the accelerator is engineered around streaming movement of traces, weights, and activities through memory hierarchies and pipelined kernels. The line of work spans StreamBrain’s prototype FPGA offload engine on an Intel Stratix V DE5-Net board, a reconfigurable stream-based accelerator on an AMD Xilinx Alveo U55C, and a Zynq UltraScale+ SoC implementation described as the first embedded FPGA accelerator for BCPNN, with online learning and inference-only execution on-device (Podobas et al., 2021, Hafiz et al., 3 Mar 2025, Hafiz et al., 23 Jun 2025).

1. Algorithmic basis of BCPNN and its hardware implications

BCPNN, or Bayesian Confidence Propagation Neural Network, is presented as a brain-like neural network model with a Hebbian, Bayes-derived learning rule. The cited work focuses on the rate-based formulation. The network is organized around hypercolumns (HCUs), which are groups of neurons associated with a variable or feature region, and minicolumns (MCUs), which are units inside an HCU that capture a specific “instance” or variant. In the StreamBrain formulation, BCPNN treats the network as a collection of random variables with a joint distribution,

$p(x_1, x_2, \ldots, x_n, y_1, \ldots, y_m, z_1, \ldots, z_l).$

The model supports unsupervised, semi-supervised, and supervised learning depending on stage, and it includes structural plasticity, meaning sparse connectivity can evolve dynamically (Podobas et al., 2021).

The core state variables are incremental probability estimates or traces. In the hardware-oriented formulations, the model tracks pre-synaptic probability $p_i$ , post-synaptic probability $p_j$ , and joint probability $p_{ij}$ , with synaptic parameters derived from logarithmic expressions. The StreamBrain paper expresses the update mechanism as exponentially weighted statistics,

$C_i \gets (1 - \lambda) C_i + \lambda \langle a_i \rangle,$

$C_j \gets (1 - \lambda) C_j + \lambda \langle a_j \rangle,$

$C_{ij} \gets (1 - \lambda) C_{ij} + \lambda \langle a_i \otimes a_j \rangle,$

followed by bias and weight computation. The later embedded work states the standard parameterization as

$b_j = \log p_j,\qquad w_{ij} = \log\left(\frac{p_{ij}}{p_i p_j}\right),$

while the hidden-layer activation inside an HCU is a softmax over MCU units,

$a_j = \frac{\exp(s_j)}{\sum_m \exp(s_m)}.$

Structural plasticity is implemented through a dynamic masking mechanism in which each HCU maintains a fixed number of active incoming connections, connections are scored by mutual information, the weakest active connection is silenced, and the strongest silent connection is activated (Podobas et al., 2021, Hafiz et al., 23 Jun 2025).

These equations have two opposing hardware consequences. On one hand, the StreamBrain study models training cost as dominated by batched matrix multiplies,

$T = \mathcal{O}(n_{cycles} \cdot N_B \cdot B_S \cdot N_H \cdot (N_F + N_O)),$

and argues that BCPNN’s dominant computations are dense batched matrix and outer-product operations: estimation of marginals and co-occurrences, batched activations, weight updates, and bias updates. On the other hand, the embedded FPGA study emphasizes that BCPNN is memory intensive, uses multiple state variables per synapse, requires irregular trace and weight updates, and often stresses off-chip memory bandwidth more than raw compute. A plausible implication is that BCPNN is simultaneously compute-regular enough for pipelining and memory-demanding enough that hierarchy design becomes decisive on embedded devices (Podobas et al., 2021, Hafiz et al., 23 Jun 2025).

2. Evolution from HPC framework to embedded accelerator

The earliest system in this sequence is StreamBrain, a domain-specific language described as a Keras-like DSL/API for BCPNN. It is written in Python, with performance-critical kernels factored out into optimized backend implementations. The framework supports a CPU backend using OpenMP and vectorized NumPy/MKL, a GPU backend using CUDA and cuBLAS, an FPGA backend using OpenCL and HLS on Intel FPGA, and an MPI backend for distributed CPU training. Its user workflow follows a conventional high-level pattern with Network, add, fit, and evaluate, while its execution model supports both streaming and batched operation, though the paper focuses on batched execution (Podobas et al., 2021).

The FPGA backend in StreamBrain is a prototype offload engine targeting an Intel Stratix V DE5-Net board. Rather than hand-writing RTL, it uses the Intel OpenCL SDK for FPGA, High-Level Synthesis, FloPoCo-generated custom floating-point operators, and a custom OpenCL library exposing those operators. Its design philosophy is selective acceleration: only the most expensive BCPNN kernels are moved to the FPGA, and the architecture is kept resource-conscious and amenable to future stream-based use (Podobas et al., 2021).

The 2025 Alveo U55C work extends this direction into a reconfigurable stream-based FPGA accelerator implemented with Xilinx Vitis HLS. The kernel supports inference, supervised training, unsupervised training, structural plasticity, and combinations of the above. Its high-level organization preserves the three-population structure—input, hidden, and output—with input-to-hidden and hidden-to-output projection layers, and it explicitly shifts from a monolithic sequential kernel to a dataflow, stream-oriented pipeline (Hafiz et al., 3 Mar 2025).

The Zynq UltraScale+ SoC work then moves the design into an embedded setting. It describes the system as a heterogeneous SoC design with the ARM Processing System (PS) running Ubuntu and orchestrating execution, while the FPGA Programmable Logic (PL) contains the actual BCPNN kernels. The target platform is the ZCU104, and the work is presented as the first embedded FPGA accelerator for BCPNN on a Zynq UltraScale+ SoC using High-Level Synthesis. This progression from a server-side prototype offload engine, to a high-performance Alveo accelerator, to a PS–PL embedded deployment defines the current technical meaning of an embedded FPGA accelerator for BCPNN (Hafiz et al., 23 Jun 2025, Hafiz et al., 3 Mar 2025).

3. Dataflow architecture, memory hierarchy, and stream organization

A central architectural principle across the literature is that BCPNN is best mapped as a stream-oriented pipeline rather than as a sequence of host-mediated kernel launches. In StreamBrain, the FPGA backend accelerates the two main computational kernels, updateMarginals() and updateWeights(), and merges them into one FPGA kernel to reduce area/resource overhead, improve temporal locality, and share data and operators. The pipeline is described in four stages: an address generator that prefetches data from external DDR memory into local block RAM (BRAM), a custom matrix engine that performs the BLAS-3-style matrix-matrix operations, a network probability unit that finalizes probability and marginal updates, and a write-back stage that streams results back to external DDR memory. The memory organization is explicitly two-level: external DDR stores most data, while on-chip BRAM acts as a local working buffer (Podobas et al., 2021).

The Alveo U55C design generalizes this into a stream-based accelerator using FIFO channels and Vitis HLS dataflow. The authors first built a sequential baseline and then replaced array-based movement with streams, partitioning data into fixed-size segments that flow through FIFO streams between pipeline stages. HLS dataflow allows each stage to begin as soon as partial input data is available, with carefully tuned FIFO depths to avoid deadlock. The paper reports that this optimization gave roughly a 70% performance improvement over the initial sequential version. It further introduces HBM spread-memory mapping for large arrays in the input-hidden projection: arrays were split into four segments, each segment mapped to a separate HBM channel, data was read via 512-bit burst reads, and the outputs were merged into packets of 64 floating-point values for the input-hidden projection and 16 floating-point values for the hidden-output projection. The paper describes this access strategy as yielding about a 64× latency reduction in the relevant access pattern (Hafiz et al., 3 Mar 2025).

The embedded Zynq design adapts the same principles to tighter area and bandwidth constraints. The host-side C++ program loads datasets from Micro SD into DDR memory, prepares FPGA buffers, and triggers kernel execution through AXI control. The accelerator pulls data from DDR using 256-bit AXI4 memory-mapped burst transfers, which can move either 8 single-precision floats or 16 half-precision values per cycle, converts those data into streams, and distributes them among subkernels through AXI-stream FIFO channels. Parallelism is introduced at the inter-kernel level and inside each kernel through loop unrolling, but the full online-learning kernel caps the unroll factor at 4 to stay within area and memory limits. Less time-critical variables such as inputs, labels, random values, and outputs are bundled onto a single AXI interface, while more performance-critical signals get separate bundles. For the input-to-hidden weight matrix, which must be accessed by multiple subkernels, the design uses multiple AXI bundles to enable burst fetching and parallel streams, avoiding deadlock and deep FIFO requirements (Hafiz et al., 23 Jun 2025).

Taken together, these systems show a stable architectural pattern: host-side orchestration, burst-oriented off-chip memory access, on-chip buffering, FIFO-based streaming between subkernels, and selective parallelism constrained by BRAM and routability. This suggests that the phrase “embedded FPGA accelerator for BCPNN” refers less to a single kernel template than to a recurring memory-centric dataflow strategy (Podobas et al., 2021, Hafiz et al., 3 Mar 2025, Hafiz et al., 23 Jun 2025).

4. Online learning, inference specialization, and numerical precision

The full online-learning accelerator implements more than feedforward inference. In the embedded Zynq design, the full kernel supports synaptic trace updates, probability estimation from those traces, bias update, weight update, and handling of structural plasticity or sparse connectivity metadata where applicable. Because the full kernel stores multiple large synaptic traces required for plasticity updates, it requires more memory and more interface complexity than the inference-only kernel. The paper also notes the use of Vitis HLS unsafe-math optimizations to make floating-point arithmetic cheaper and faster in hardware, with minimal reported accuracy impact on the tested workloads (Hafiz et al., 23 Jun 2025).

The inference-only kernel is a specialized version of the full BCPNN accelerator. It removes all synaptic trace updates and bias/weight update operations, which are only needed during online learning. The embedded paper attributes four effects to this specialization: much lower resource use, more parallelism, lower memory pressure, and higher throughput. In particular, the inference-only kernel supports a parallelism factor of 8 in FP32 and 16 in FP16 or mixed precision, whereas the full kernel is limited to a factor of 4. The trained-parameter flow mirrors this separation: training produces a binary parameter file containing weights, biases, sparse indices, and constants, and that file is transferred once into FPGA buffer memory for deployment (Hafiz et al., 23 Jun 2025).

Precision handling is a major differentiator between FPGA and CPU or GPU implementations. StreamBrain emphasizes variable-precision arithmetic on FPGA using FloPoCo-generated custom operators for add/sub, multiply, division, and logarithm. It explores reduced-precision IEEE-754-like formats—BF28, BF24, BF20, BF16, BF15, and BF14—described as IEEE-754 single-precision variants with reduced mantissa. The paper reports operating frequency roughly 198–252 MHz across these formats, reduced precision generally lowers resource usage, and DSP utilization drops sharply from about 70%+ down to ~10% for BF16 and below; the authors interpret this as smaller operators being synthesized into logic rather than DSP blocks. Because BCPNN requires logarithms and divisions in addition to multiply–accumulate operations, precision effects extend beyond conventional MAC units (Podobas et al., 2021).

The embedded Zynq work studies FP32, FP16, and mixed precision. FP32 is the baseline and fetches 8 values per 256-bit burst. FP16 doubles the memory-side parallelism relative to FP32 by carrying 16 half-precision values per burst; its numeric range is approximately $p_i$ 0 with a minimum positive increment of $p_i$ 1. The mixed-precision configuration uses fixed-point storage in Q3.12 format for data storage and FP16 for computationally sensitive accumulations; Q3.12 means 4 integer bits and 12 fractional bits, giving a range of approximately $p_i$ 2 and a minimum increment of $p_i$ 3. The paper concludes that FP16 generally gives the best overall balance, while mixed precision can reduce resources further but may cause noticeable accuracy loss on harder datasets (Hafiz et al., 23 Jun 2025).

5. Performance on server-class and accelerator-card platforms

StreamBrain establishes the broader HPC baseline for BCPNN. Evaluations were run on a Cray XC40 CPU cluster, HPC2N CPU/GPU systems, an A100 GPU system, and an FPGA system with Intel DE5-Net Stratix V. For MNIST, the paper reports that training the full dataset with StreamBrain can take ~10 seconds on NVIDIA A100 or ~4 minutes on a server-class Xeon CPU, with GPU speedup over CPU ranging roughly 7.75× to 65×. Inference throughput is reported as 28k–87k images/s for streaming or single-image inference and up to 350k images/s for large batches on GPU. Average test accuracy across implementations is above 95%, the hybrid approach with a BCPNN hidden layer and SGD output layer reaches 97.5%, and the reported average in that hybrid setting is 97.77%. The same framework trained to about 95.5% MNIST accuracy faster than a comparable PyTorch MLP on A100, with 10.5 s for StreamBrain versus 33.94 ± 1.04 s for PyTorch. On STL-10, the paper states that it is the first demonstration of BCPNN on STL-10-sized networks; for A100, BCPNN training time is 178.2 ± 0.1 s with 34.8 ± 4.9% accuracy, compared with 100.2 ± 0.43 s and 42.2 ± 0.12% for a PyTorch MLP of similar size. With MPI+OpenMP on STL-10 and batch size 512, strong scaling capped around 2.7× on one system and reached 5.25× at 8 nodes on another, with comparable accuracy even at large batch size (Podobas et al., 2021).

The Alveo U55C work focuses on direct FPGA-versus-GPU comparisons under a stream-based architecture. It reports that the proposed accelerator is between 1.3× and 5.3× faster than an Nvidia A100 GPU while at the same time consuming between 2.62× and 3.19× less power and 5.8× and 16.5× less energy without any degradation in performance. On its MNIST model, inference latency is 1.495 ms on GPU and 0.280 ms on FPGA, training latency is 1.497 ms on GPU and 0.422 ms on FPGA, and structural plasticity latency is 1.520 ms on GPU and 0.508 ms on FPGA. For Pneumonia, the corresponding inference latencies are 1.633 ms and 0.504 ms; for Breast, 1.541 ms and 0.540 ms. The paper also reports power of 83.2 W for the GPU and 27.0 W for the FPGA on Model 1, 89.8 W and 28.1 W on Model 2, and 68.4 W and 26.1 W on Model 3 (Hafiz et al., 3 Mar 2025).

Resource and frequency behavior on the U55C underscore the cost of supporting learning and structural plasticity. For Model 1, inference uses 174,400 LUTs, 257,462 FFs, 550 DSPs, 327.5 BRAMs, and runs at 200 MHz; training uses 454,024 LUTs, 546,419 FFs, 3,573 DSPs, 437.5 BRAMs, and runs at 150 MHz; the structural-plasticity version uses 475,074 LUTs, 574,657 FFs, 3,765 DSPs, 473.5 BRAMs, and runs at 147.3 MHz. The paper’s roofline-style analysis reports an estimated compute capability of 288.77 GFLOP/s and HBM peak bandwidth of 460 GB/s, but also states that none of the models reaches the theoretical peak because of resource usage below the assumed 80% ideal, algorithmic dependencies, and FIFO, BRAM, and routing constraints (Hafiz et al., 3 Mar 2025).

6. Embedded SoC results, limitations, and research directions

The embedded Zynq study evaluates the accelerator on the AMD Xilinx ZCU104 board with an XCZU7EV device under Ubuntu 22.04. The host program is C++/OpenCL, the FPGA kernel is written in C++ and synthesized with Vitis 2023.2 HLS, and the design uses 256-bit AXI bursts. Clock frequencies are tuned to avoid timing violations: 120 MHz for FP32 and 130 MHz for FP16 or mixed precision. Latency is measured with gettimeofday() around full kernel execution, including memory transfers, power is measured with the onboard INA226 sensor at 10 ms intervals, and accuracy is measured on held-out test data (Hafiz et al., 23 Jun 2025).

On MNIST, the full online-learning kernel on FPGA achieves 55.4 ms training latency versus 145.5 ms on ARM, 17.8 ms inference latency versus 36.7 ms on ARM, and 94.3% test accuracy versus 94.6% on ARM. For inference-only MNIST, the FPGA achieves 3.4 ms versus 37.8 ms on ARM, corresponding to 11.12× speedup, with 94.6% accuracy. For inference-only Pneumonia, the reported figures are 18.8 ms on FPGA versus 309.1 ms on ARM, 16.45× speedup, and 86.2% accuracy on both platforms. For inference-only Breast Cancer, the figures are 30.3 ms on FPGA versus 531.9 ms on ARM, 17.56× speedup, and 84.0% accuracy on both platforms. The paper reports energy savings for inference-only kernels relative to ARM of 91.1% board energy and 89.0% execution energy on MNIST, 93.7% board and 87.7% execution on Pneumonia, and 94.1% board and 87.8% execution on Breast Cancer. For the full MNIST kernel, the savings are 60.9% board energy and 4.4% execution energy (Hafiz et al., 23 Jun 2025).

Resource utilization on the embedded device sharply distinguishes online learning from inference. For the full MNIST kernel, the design uses approximately 49.40% LUT, 39.88% FF, 31.37% DSP, and 76.28% BRAM. The inference-only kernels are substantially lighter: MNIST inference uses 18.00% LUT, 11.19% FF, 13.43% DSP, and 25.16% BRAM; Pneumonia inference uses 18.53% LUT, 11.56% FF, 13.48% DSP, and 39.58% BRAM; Breast inference uses 18.45% LUT, 11.20% FF, 13.48% DSP, and 51.44% BRAM. The same paper studies scaling on Pneumonia by varying HCU, MCU, and connectivity sparsity, reporting that reducing HCU from 30 to 10 cuts latency substantially, up to about 66%, reducing MCU also helps though less dramatically, and increasing sparsity can reduce energy, but aggressive sparsification harms accuracy sharply. Importantly, FPGA resource usage stays nearly constant across these scaling experiments because the stream-based architecture fixes the parallelization width and FIFO structure (Hafiz et al., 23 Jun 2025).

Several limitations recur across the literature. StreamBrain accelerates only the two heaviest kernels rather than the whole BCPNN pipeline, uses a Stratix V board from 2010, evaluates only batched mode despite a streaming-oriented architecture, and finds that BF16 is usable with limited loss whereas BF15 and BF14 degrade heavily. The Alveo U55C study identifies host-device transfer overhead, structural plasticity overhead on smaller datasets, BRAM limits for larger inputs, and the fact that not all configurations hit peak roofline performance. The embedded Zynq study shows that the full online-learning kernel is BRAM-heavy because of synaptic traces and multiple AXI interfaces, and that mixed precision can cause noticeable accuracy loss on Pneumonia and Breast Cancer. The explicitly suggested future directions include extending FPGA backends toward streaming inputs from devices like cameras, exploring other low-precision number systems such as Posits, building toward ASIC acceleration, and leveraging future BF16-capable hardware (Podobas et al., 2021, Hafiz et al., 3 Mar 2025, Hafiz et al., 23 Jun 2025).

The combined record supports a precise interpretation of the field. The embedded FPGA accelerator for BCPNN is not merely a port of a brain-like model onto configurable logic; it is a family of designs in which local Bayesian-Hebbian learning, sparse modular structure, and online trace maintenance are reorganized into stream-oriented kernels, burst-based memory access, and precision-tunable arithmetic. This suggests that the central research problem is not whether BCPNN can run on FPGA, but how much of its learning dynamics can be retained while respecting the BRAM, bandwidth, and power constraints of embedded systems (Hafiz et al., 23 Jun 2025).