Floorline Performance Model in Neuromorphic Accelerators
- Floorline Performance Model is a framework that characterizes and bounds neuromorphic accelerator performance based on per-core maximum synaptic, compute, and traffic loads.
- The model employs a two-stage optimization methodology using sparsity-aware training and floorline-informed partitioning to pinpoint bottlenecks and enhance performance.
- Empirical validations on platforms like Loihi 2 demonstrate significant speedup and energy improvements by transitioning workloads among memory-bound, compute-bound, and traffic-bound states.
A floorline performance model is a theoretical and empirical framework for characterizing, bounding, and optimizing execution time and energy efficiency in neuromorphic accelerators. In contrast to conventional roofline models, which primarily address memory and compute ceilings, the floorline model exposes three competitive bottleneck states—memory-bound, compute-bound, and traffic-bound—and leverages per-core maximum statistics for synaptic, compute, and message loads to predict actual performance. The model enables pinpoint allocation of optimization effort and informs a two-stage methodology for workload restructuring on real neuromorphic hardware (Yik et al., 26 Nov 2025).
1. Theoretical Basis and Bottleneck States
Neuromorphic accelerators, comprised of arrays of neurocores, execute ML inference using event-driven dataflows and spatially expanded architectures that co-locate memory and computation. In each timestep, every core carries out three main operations:
- Synaptic operations (synops): Retrieving weights from local memory and performing accumulations.
- Neuron activation computations (act-comp): Applying activation nonlinearities and updating internal neuronal state.
- NoC message traffic (msg): Emitting sparse activation spikes over the on-chip network.
Because all relevant state (weights, neuron activations, messages) resides on-chip, the cost of each operation may be comparable, and any of the three may dominate step time. This yields three distinct bottleneck states:
- M1. Memory-bound (synops-bound): The slowest core dictates the timestep duration due to its peak synop load. Time scales with the maximum per-core synaptic operations.
- M2. Compute-bound: When architectural choices or sparsity reduce synop load, the bottleneck can shift to neuron activation computations.
- M3. Traffic-bound: Design choices (e.g., high core utilization or layer partitioning) can provoke NoC congestion; synchronization waits for the most spike-active core.
A critical observation is that per-core maximum (not global aggregate) synop, act-comp, or spike loads control which bottleneck state is active.
2. Mathematical Formulation of Performance Bounds
Let each neurocore perform synops, activation computations, and outgoing spike messages per timestep. Define maximum intensities:
Architectural peak rates are:
- : peak synop bandwidth (synops/sec)
- : peak neuron-compute (activations/sec)
- : peak NoC bandwidth (spikes/sec)
Lower bounds on timestep time:
- Memory-bound:
- Compute-bound:
- Traffic-bound:
Actual performance is:
Due to barrier synchronization, the highest of these three sets the realized timestep duration.
3. Floorline Visualization and Interpretation
The floorline plot visualizes per-core synop intensity on the x-axis versus measured or predicted timestep time on the y-axis, typically as a log–log plot. The main analytic boundaries are:
- Memory-bound slope: (linear with , slope +1 in log–log, 45°)
- Compute floor: As , plateaus at
- Traffic ceiling: Workloads above both boundaries are set by
Each workload-adapted network, once mapped and partitioned, yields a point . The point’s relationship to the slope and floor precisely indicates which bottleneck is active and which mapping, partitioning, or sparsity transformation is likely to yield performance improvement.
4. Analytical and Empirical Model Validation
Analytical modeling predicts scaling trends for synops, activation computes, and message loads as functions of activation sparsity , weight sparsity , network width , and number of cores :
- Synops:
- Activation computes:
- Messages (for downstream layer with cores):
Microbenchmarks on three neuromorphic platforms—Brainchip AKD1000 (80 cores), Synsense Speck (9 cores), and Intel Loihi 2 (120 cores)—were used for quantitative parameter calibration and validation. Benchmarks sweep network sparsity, partitioning, and core mapping:
- Workloads trace the analytic memory-bound line until is reduced to point where the compute floor dominates.
- Aggressive partitioning further lowers and compute floor, but increases power.
- Extensive core utilization may push workloads above the slope–floor envelope to traffic-bound; strided placement heuristics (on Loihi 2) restore memory-bound scaling.
- These empirical findings yielded precise parameters for , , and , confirming the model’s fidelity.
5. Floorline-Guided Two-Stage Optimization Methodology
Optimization follows a two-stage workflow:
Stage 1: Sparsity-Aware Training
- Apply activation and synop-count regularizers targeting spike-based sparsity (e.g., Transformed on AKD1000 and PilotNet, synop-count penalties on Speck).
- Employ one-shot pruning with fine-tuning for Loihi 2.
- Adjust per-layer sparsity schedules to balance core loads, preventing per-core bottleneck conditions.
Stage 2: Floorline-Informed Partitioning & Mapping
- For workloads on memory-bound slope, partition core with peak ; for compute floor, partition core with peak .
- For traffic-bound workloads, remap using “strided” core assignment to mitigate .
- Only retain transformations which lower , with explicit backtracking to avoid unnecessary power escalation.
- Stop once the workload lies on the floorline envelope for the given network and hardware.
6. Case Study: PilotNet Optimization on Loihi 2
Applying the methodology to the PilotNet CNN on Loihi 2 yields concrete improvements:
- Hardware limits: synops/s, activations/s, spikes/s.
- Initial trained network: , , .
- Compute step times:
- ms
- ms
- ms
- ms (compute-bound)
- Partitioning to more cores: ~ reduces stepwise from compute-bound to traffic-bound, then remapping returns to compute floor, achieving ms, a 4 speedup, with energy per step improved roughly 3.
These results demonstrate stepwise optimization along the floorline axes, moving from compute to traffic-bound to improved compute-bound via partitioning and core remapping.
7. Significance and Generalization
The floorline performance model extends roofline analysis principles to account for neuromorphic architectures featuring tightly interleaved on-chip memory, computation, and message traffic. By positioning workloads in space, practitioners can immediately identify bottlenecks and allocate optimization effort (sparsity, partitioning, remapping) to approach theoretical performance bounds. The two-stage optimization regime yields multi-fold improvements in runtime and energy efficiency for diverse workloads across current neuromorphic architectures and is extensible to future chip designs (Yik et al., 26 Nov 2025).