SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

Published 13 May 2026 in q-bio.NC and cs.LG | (2605.12992v1)

Abstract: Neural population models, which predict the joint firing of many simultaneously recorded neurons forward in time, are typically evaluated by a single aggregate Pearson correlation $r$ between predicted and actual spike counts, a number that masks critical structure. We argue that how we evaluate spike forecasting matters as much as what we build, and introduce SpikeProphecy, the first large-scale benchmark for causal, autoregressive spike-count forecasting on real electrophysiology recordings. Our core contribution is a population metric decomposition that separates aggregate performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment. The decomposition surfaces aspects of the underlying data that an aggregate scalar collapses together. We apply the protocol to 105 Neuropixels sessions (Steinmetz 2019 + IBL Repeated Site; ~89,800 neurons) with seven architecture baselines spanning four structural families: four SSMs (three diagonal and one non-diagonal), a Transformer, an LSTM, and a spiking network. The decomposition surfaces a brain-region predictability ranking that reproduces across all seven baselines and survives ANCOVA correction for firing-statistics constraints (region $ΔR² = 0.018$ above the firing-statistics covariates). It also exposes a sub-Poisson evaluation floor where rigorous metrics combine with genuine biophysical constraints on regular spike trains, and yields a negative result on KL-on-output-rates distillation for ANN-to-SNN transfer in this Poisson count domain.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces a benchmark for one-step-ahead neural spike-count forecasting using strict autoregressive and causal constraints.
It decomposes performance across population rate, spatial patterns, and cosine similarity to reveal region-specific predictability and biophysical noise floors.
Deep sequence models show significant gains over linear baselines, though challenges remain with sub-Poisson neurons and knowledge distillation for SNNs.

SpikeProphecy: Establishing the Benchmark for Autoregressive Neural Population Forecasting

Introduction

The paper "SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting" (2605.12992) addresses a fundamental gap in computational neuroscience and neuroengineering: the lack of standardized, large-scale, and reproducible benchmarks for forecasting future neural population activity from past dynamics in high-density electrophysiological datasets. While neural population models have seen rapid advances via architectures derived from language and time-series modeling, evaluation practices remain limited—often relying on a single scalar metric that obscures both population-level and feature-specific modeling challenges. This work establishes the first benchmark and evaluation protocol specifically for autoregressive spike-count forecasting, leveraging two large Neuropixels datasets to systematically compare a range of modern sequence models and introduce a decomposed suite of evaluation metrics tailored to neural population activity.

Benchmark Construction and Evaluation Protocol

SpikeProphecy operationalizes the task of one-step-ahead forecasting of binned spike counts across thousands of simultaneously recorded neurons using recent spike history as the only input, enforcing both strict autoregressivity and causality. The primary datasets derive from Steinmetz et al. 2019 and the International Brain Laboratory (IBL) Repeated Site initiative, collectively comprising over 89,000 neurons and 105 recording sessions—spanning multiple brain regions, labs, and experimental setups.

The evaluation protocol forms the core methodological advance: performance is decomposed into three axes—

Population Rate $r$ ( $r_\mathrm{pop}$ ): Captures fidelity in modeling the overall temporal envelope of population activity.
Spatial Pattern $r$ ( $r_\mathrm{spatial}$ ): Quantifies cross-neuron pattern fidelity at each timepoint.
Cosine Similarity: Assesses magnitude-invariant alignment of predicted and true responses.

This decomposition is empirically demonstrated to uncover critical structure—such as region-specific predictability hierarchies and biophysical noise floors—that are completely masked by aggregate Pearson correlation metrics frequently used in prior work.

Figure 1: Overview of the SpikeProphecy benchmark—showing per-architecture results, parameter/accuracy Pareto front, metric decompositions, per-neuron score spread, and session-level rate traces.

Model Suite and Baseline Analysis

Seven primary architectures are compared using this protocol, all with harmonized optimization and training schedules:

Diagonal SSMs: Mamba, HGRN2, LRU.
Non-diagonal SSM: GatedDeltaNet.
Transformer (causal attention).
LSTM (classical baseline).
RSynaptic SNN (event-driven, neuromorphic).

Comprehensive linear controls (autoregressive GLM, population GLM with ridge regularization) calibrate the difficulty of the task and sensitivity to information leakage or overfitting.

Key Empirical Findings

Decomposition Surfaces Brain Region Predictability Hierarchy

Applying the decomposed metrics across 8 Allen CCF functional brain regions reveals a significant, reproducible hierarchy in forecastability. ANCOVA analysis controlling for firing statistics (log rate and Fano factor) finds that region assignment explains a nontrivial, reproducible increment in explainable variance ( $\Delta R^2 = 0.018$ above covariates, total $R^2 = 0.275$ , $p < 10^{-77}$ ). This ranking is stable across all architectures; motor cortex and midbrain exhibit the most predictable short-timescale activity, while hippocampal and limbic regions are substantially harder to forecast within the provided context window.

Figure 2: ANCOVA-adjusted per-neuron $r$ across eight functional brain regions, highlighting reproducible hierarchy and sub-Poisson floor.

Sub-Poisson Evaluation Floor and Metric Limitation

A substantial fraction (28%) of recorded neurons are sub-Poisson ( $\mathrm{FF}<1$ ), exhibiting highly regular, oscillator-like firing. These neurons define a hard lower limit for model performance: mean $r$ values are $r_\mathrm{pop}$ 0, independent of architecture. This limitation is entangled between irreducible biophysical variability and the mathematical harshness of Pearson correlation in low-variance regimes—motivating the necessity of Fano-stratified metric reporting.

Linear versus Deep Modeling Regimes

Linear baselines (GLMs) either fail to generalize due to within-session nonstationarity (autoregressive GLM, $r_\mathrm{pop}$ 1) or overfit catastrophically under high-dimensional input regimes (population GLM, $r_\mathrm{pop}$ 2 on validation unless aggressively regularized). Deep sequence models (Mamba, HGRN2, GatedDeltaNet, Transformer, LRU), in contrast, form a statistically indistinguishable cluster ( $r_\mathrm{pop}$ 3– $r_\mathrm{pop}$ 4), exhibiting $r_\mathrm{pop}$ 5– $r_\mathrm{pop}$ 6 performance gains over linear models on valid splits, with SNNs and LSTM trailing consistently.

Negative Result for Output KL Distillation

Contrary to established practice in ANN-to-SNN knowledge transfer for classification tasks, output KL-divergence-based distillation does not improve SNN forecasting performance. Standalone SNNs are maximally efficient at smaller depths; soft-label teacher rates introduce no “dark knowledge” benefit in real-valued Poisson regression, and deeper network architectures degrade sharply.

Ceiling and Rollout Behavior

Empirical oracle ceiling analysis indicates that current models capture $r_\mathrm{pop}$ 774% of the achievable signal, with the remainder attributable to irreducible noise. Deeper rollout of predictions reveals rapid degradation in per-neuron $r_\mathrm{pop}$ 8 for all ANN architectures, while SNNs decline more slowly at long horizons, consistent with architectural differences in error propagation mechanisms.

Figure 3: Degradation of autoregressive rollout across all baselines, highlighting per-neuron instability at longer horizons and increased SNN robustness.

Figure 4: Oracle ceiling analysis demonstrating sub-Poisson neuron floor, neuronal $r_\mathrm{pop}$ 9 versus achievable ceiling, and model efficiency distribution.

Practical and Theoretical Implications

The protocol demonstrates that effective deployment of neural forecasting models—e.g., in closed-loop BCIs—requires moving beyond aggregate metrics. Application domains differ in sensitivity to population-level versus neuron-specific dynamics; decomposed metrics enable appropriate model selection for use cases such as brain-state classification, BCI gating, or targeted stimulation. The substantial gap between the predictability of motor versus non-motor regions highlights critical challenges for modeling complex, internally generated activity (e.g., hippocampus) at physiologically realistic timescales. Observations of distinctly nonlinear gains in deep models, the persistence of biophysical evaluation floors, and the ineffectiveness of standard ANN-to-SNN distillation methods collectively chart a nuanced path for future method development—including latent-state modeling, dispersion-aware loss formulations (e.g., Conway–Maxwell–Poisson), and architecture design tailored to the spatiotemporal structure of neural data.

Release and Reproducibility

All code, processed data (with links to underlying raw repositories), and trained model checkpoints have been released for reproducibility, including a pip-installable evaluation toolkit and a comprehensive leakage-audit suite. This facilitates both robust cross-laboratory comparison and extension to new modeling approaches, including foundation-model pretraining for strictly causal forecasting.

Conclusion

SpikeProphecy defines a rigorous, accessible, and auditable standard for autoregressive neural population forecasting on large-scale, real electrophysiological data. Its decomposed metrics and stratified reporting capture performance axes critically relevant for both scientific understanding and neurotechnology deployment, substantively advancing the benchmark-driven development paradigm in computational neuroscience and AI modeling of biological populations. The released dataset and toolkit lower entry barriers and establish a reproducibility baseline, signaling a practical shift in evaluation methodologies for large-scale neural sequence modeling.

Markdown Report Issue