Papers
Topics
Authors
Recent
2000 character limit reached

Floorline Performance Model in Neuromorphic Accelerators

Updated 4 December 2025
  • Floorline Performance Model is a framework that characterizes and bounds neuromorphic accelerator performance based on per-core maximum synaptic, compute, and traffic loads.
  • The model employs a two-stage optimization methodology using sparsity-aware training and floorline-informed partitioning to pinpoint bottlenecks and enhance performance.
  • Empirical validations on platforms like Loihi 2 demonstrate significant speedup and energy improvements by transitioning workloads among memory-bound, compute-bound, and traffic-bound states.

A floorline performance model is a theoretical and empirical framework for characterizing, bounding, and optimizing execution time and energy efficiency in neuromorphic accelerators. In contrast to conventional roofline models, which primarily address memory and compute ceilings, the floorline model exposes three competitive bottleneck states—memory-bound, compute-bound, and traffic-bound—and leverages per-core maximum statistics for synaptic, compute, and message loads to predict actual performance. The model enables pinpoint allocation of optimization effort and informs a two-stage methodology for workload restructuring on real neuromorphic hardware (Yik et al., 26 Nov 2025).

1. Theoretical Basis and Bottleneck States

Neuromorphic accelerators, comprised of arrays of neurocores, execute ML inference using event-driven dataflows and spatially expanded architectures that co-locate memory and computation. In each timestep, every core carries out three main operations:

  • Synaptic operations (synops): Retrieving weights from local memory and performing accumulations.
  • Neuron activation computations (act-comp): Applying activation nonlinearities and updating internal neuronal state.
  • NoC message traffic (msg): Emitting sparse activation spikes over the on-chip network.

Because all relevant state (weights, neuron activations, messages) resides on-chip, the cost of each operation may be comparable, and any of the three may dominate step time. This yields three distinct bottleneck states:

  • M1. Memory-bound (synops-bound): The slowest core dictates the timestep duration due to its peak synop load. Time scales with the maximum per-core synaptic operations.
  • M2. Compute-bound: When architectural choices or sparsity reduce synop load, the bottleneck can shift to neuron activation computations.
  • M3. Traffic-bound: Design choices (e.g., high core utilization or layer partitioning) can provoke NoC congestion; synchronization waits for the most spike-active core.

A critical observation is that per-core maximum (not global aggregate) synop, act-comp, or spike loads control which bottleneck state is active.

2. Mathematical Formulation of Performance Bounds

Let each neurocore cc perform ScS_c synops, CcC_c activation computations, and McM_c outgoing spike messages per timestep. Define maximum intensities:

Smax=maxcSc,Cmax=maxcCc,Mmax=maxcMcS_{\max} = \max_c S_c, \quad C_{\max} = \max_c C_c, \quad M_{\max} = \max_c M_c

Architectural peak rates are:

  • BmemB_{\rm mem}: peak synop bandwidth (synops/sec)
  • FpeakF_{\rm peak}: peak neuron-compute (activations/sec)
  • BnocB_{\rm noc}: peak NoC bandwidth (spikes/sec)

Lower bounds on timestep time:

  • Memory-bound: Tmem=SmaxBmem\displaystyle T_{\rm mem} = \frac{S_{\max}}{B_{\rm mem}}
  • Compute-bound: Tcomp=CmaxFpeak\displaystyle T_{\rm comp} = \frac{C_{\max}}{F_{\rm peak}}
  • Traffic-bound: Tnoc=MmaxBnoc\displaystyle T_{\rm noc} = \frac{M_{\max}}{B_{\rm noc}}

Actual performance is:

Tstep=max(Tmem,Tcomp,Tnoc)T_{\rm step} = \max(T_{\rm mem},\,T_{\rm comp},\,T_{\rm noc})

Due to barrier synchronization, the highest of these three sets the realized timestep duration.

3. Floorline Visualization and Interpretation

The floorline plot visualizes per-core synop intensity SmaxS_{\max} on the x-axis versus measured or predicted timestep time TT on the y-axis, typically as a log–log plot. The main analytic boundaries are:

  • Memory-bound slope: T=Smax/BmemT = S_{\max}/B_{\rm mem} (linear with SmaxS_{\max}, slope +1 in log–log, 45°)
  • Compute floor: As Smax0S_{\max} \to 0, TT plateaus at T=Cmax/FpeakT = C_{\max}/F_{\rm peak}
  • Traffic ceiling: Workloads above both boundaries are set by TnocT_{\rm noc}

Each workload-adapted network, once mapped and partitioned, yields a point (Smax,Tstep)(S_{\max},\,T_{\rm step}). The point’s relationship to the slope and floor precisely indicates which bottleneck is active and which mapping, partitioning, or sparsity transformation is likely to yield performance improvement.

4. Analytical and Empirical Model Validation

Analytical modeling predicts scaling trends for synops, activation computes, and message loads as functions of activation sparsity mm, weight sparsity ww, network width NN, and number of cores CC:

  • Synops: O(mwN2/C)O(m w N^2 / C)
  • Activation computes: O(N/C)O(N / C)
  • Messages (for downstream layer with CC' cores): O(mNC)O(m N C')

Microbenchmarks on three neuromorphic platforms—Brainchip AKD1000 (80 cores), Synsense Speck (9 cores), and Intel Loihi 2 (120 cores)—were used for quantitative parameter calibration and validation. Benchmarks sweep network sparsity, partitioning, and core mapping:

  • Workloads trace the analytic memory-bound line TSmaxT \propto S_{\max} until SmaxS_{\max} is reduced to point where the compute floor Cmax/FpeakC_{\max}/F_{\rm peak} dominates.
  • Aggressive partitioning further lowers CmaxC_{\max} and compute floor, but increases power.
  • Extensive core utilization may push workloads above the slope–floor envelope to traffic-bound; strided placement heuristics (on Loihi 2) restore memory-bound scaling.
  • These empirical findings yielded precise parameters for BmemB_{\rm mem}, FpeakF_{\rm peak}, and BnocB_{\rm noc}, confirming the model’s fidelity.

5. Floorline-Guided Two-Stage Optimization Methodology

Optimization follows a two-stage workflow:

Stage 1: Sparsity-Aware Training

  • Apply activation and synop-count regularizers targeting spike-based sparsity (e.g., Transformed 1\ell_1 on AKD1000 and PilotNet, synop-count penalties on Speck).
  • Employ one-shot pruning with fine-tuning for Loihi 2.
  • Adjust per-layer sparsity schedules to balance core loads, preventing per-core bottleneck conditions.

Stage 2: Floorline-Informed Partitioning & Mapping

  • For workloads on memory-bound slope, partition core with peak ScS_c; for compute floor, partition core with peak CcC_c.
  • For traffic-bound workloads, remap using “strided” core assignment to mitigate MmaxM_{\max}.
  • Only retain transformations which lower TstepT_{\rm step}, with explicit backtracking to avoid unnecessary power escalation.
  • Stop once the workload lies on the floorline envelope for the given network and hardware.

6. Case Study: PilotNet Optimization on Loihi 2

Applying the methodology to the PilotNet CNN on Loihi 2 yields concrete improvements:

  • Hardware limits: Bmem=200×106B_{\rm mem} = 200{\times}10^6 synops/s, Fpeak=50×106F_{\rm peak} = 50{\times}10^6 activations/s, Bnoc=100×106B_{\rm noc} = 100{\times}10^6 spikes/s.
  • Initial trained network: Smax(0)=1×106S_{\max}^{(0)} = 1{\times}10^6, Cmax(0)=0.5×106C_{\max}^{(0)} = 0.5{\times}10^6, Mmax(0)=0.4×106M_{\max}^{(0)} = 0.4{\times}10^6.
  • Compute step times:
    • Tmem(0)=5T_{\rm mem}^{(0)} = 5 ms
    • Tcomp(0)=10T_{\rm comp}^{(0)} = 10 ms
    • Tnoc(0)=4T_{\rm noc}^{(0)} = 4 ms
    • Tstep(0)=10T_{\rm step}^{(0)} = 10 ms (compute-bound)
  • Partitioning to more cores: ~TstepT_{\rm step} reduces stepwise from compute-bound to traffic-bound, then remapping returns to compute floor, achieving Tstep(3)2.5T_{\rm step}^{(3)} \approx 2.5 ms, a 4×\times speedup, with energy per step improved roughly 3×\times.

These results demonstrate stepwise optimization along the floorline axes, moving from compute to traffic-bound to improved compute-bound via partitioning and core remapping.

7. Significance and Generalization

The floorline performance model extends roofline analysis principles to account for neuromorphic architectures featuring tightly interleaved on-chip memory, computation, and message traffic. By positioning workloads in (Smax,T)(S_{\max}, T) space, practitioners can immediately identify bottlenecks and allocate optimization effort (sparsity, partitioning, remapping) to approach theoretical performance bounds. The two-stage optimization regime yields multi-fold improvements in runtime and energy efficiency for diverse workloads across current neuromorphic architectures and is extensible to future chip designs (Yik et al., 26 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Floorline Performance Model.