Fully Spiking Variational Autoencoders

Updated 18 June 2026

FSVAE and ESVAE are spiking neural network-based variational autoencoders that use binary spike trains and LIF neurons for efficient, event-driven generative modeling.
FSVAE employs an autoregressive Bernoulli latent process while ESVAE utilizes a reparameterizable Poisson rate model to improve image quality and reduce parameters.
Both models achieve energy-efficient, low-latency inference on neuromorphic hardware, demonstrating robust performance on benchmarks like MNIST, CIFAR10, and CelebA.

Fully Spiking Variational Autoencoders (FSVAE) and Efficient Spiking Variational Autoencoders (ESVAE) implement the variational autoencoder (VAE) paradigm entirely within the domain of spiking neural networks (SNNs). These architectures address the challenge of generative modeling on event-driven, neuromorphic-compatible hardware, achieving high-fidelity image generation with extremely low power consumption and latency. By replacing conventional continuous latent variables and floating-point sampling with spiking-based methods, FSVAE and ESVAE enable VAE inference and learning to proceed end-to-end using only binary spike trains, LIF dynamics, and hardware-feasible sampling procedures (Zhan et al., 2023, Kamata et al., 2021).

1. Architectural Foundations

Both FSVAE and ESVAE construct all VAE components—encoder, latent sampler (posterior and prior), and decoder—using standard SNN layers and Leaky Integrate-and-Fire (LIF) neurons. Image data are temporally encoded as spike trains across $T$ time steps, e.g., $T=16$ .

Encoder: Implements a deep spiking convolutional stack. For example, both models use four or five convolutional layers (MNIST, FashionMNIST, CIFAR10: 4 layers, CelebA: 5 layers), each followed by temporal-domain batch normalization (tdBN) and LIF neurons. The outputs are spike vectors per time step: $x_t^E \in \{0,1\}^C$ , $C=128$ .
Decoder: Mirrors the encoder architecture but uses (transposed) convolutional layers. It reconstructs binary spike train outputs $\hat{x}_{1:T}$ , read out as continuous-valued outputs using membrane integration across spikes.
Temporal Input Encoding: Each input pixel is spike-coded across all $T$ time steps proportional to its intensity; this converts analog images to temporally distributed binary events suitable for SNN processing.

Model	Encoder-Decoder Topology	Input Encoding Method
FSVAE	Stacked SNN convolution, FC, SNN, transposed conv	Direct pixel-to-spike coding
ESVAE	Conv-32C3 → ... → FC-128 → (sampling) → decoder	Spike-repetition; collapse for rates

2. Latent Variable Construction

FSVAE: Autoregressive Bernoulli Process

FSVAE overcomes the incompatibility between continuous Gaussian latents (as in ANNs) and binary SNN activity by using a learned, autoregressive Bernoulli process in the latent space (Kamata et al., 2021):

Latent Posterior & Prior: Both are realized as three-layer fully connected SNNs, operating over $T$ timesteps, autoregressively sampling a vector $z_{t} \in \{0,1\}^{C}$ per timestep.
Random-Select Sampling: For each latent block, $k$ parallel channels are generated and one is selected at random, ensuring a sample from Bernoulli( $\pi_t$ ) for each dimension.
MMD Divergence: Matching is performed on the sequence of Bernoulli rates using a kernel defined over the postsynaptic potential evolution.

ESVAE: Reparameterizable Poisson Rate Model

ESVAE introduces a Poisson-rate–based latent representation with direct rate parameterization (Zhan et al., 2023):

Latent Poisson Posterior: Each neuron’s spike count over $T=16$ 0 steps yields a rate $T=16$ 1; the latent distribution is $T=16$ 2 with $T=16$ 3.
Latent Poisson Prior: Generated via a bottleneck ANN: $T=16$ 4, so $T=16$ 5.
Reparameterizable Sampling: To generate a spike train $T=16$ 6 from rate $T=16$ 7, sample $T=16$ 8 and set $T=16$ 9 if $x_t^E \in \{0,1\}^C$ 0. The surrogate gradient (straight-through estimator) ensures fully differentiable training.

Model	Latent Variable Family	Sampling Mechanism
FSVAE	Autoregressive Bernoulli	Random-select from SNN channels
ESVAE	Inhomogeneous Poisson	Uniform noise with STR surrogate

3. Training Objectives and Optimization

Both FSVAE and ESVAE maximize the evidence lower bound (ELBO), tailored for spike trains:

ELBO Formulation:

$x_t^E \in \{0,1\}^C$ 1

Standard terms are adapted: - Reconstruction: Mean-squared error (MSE) between input $x_t^E \in \{0,1\}^C$ 2 and reconstruction $x_t^E \in \{0,1\}^C$ 3 is used for spike-train outputs. - Divergence: The usual KL divergence is replaced by a maximum mean discrepancy (MMD) loss, reflecting the structure and compatibility of spike-based or rate-based latents:

$x_t^E \in \{0,1\}^C$ 4

In FSVAE, $x_t^E \in \{0,1\}^C$ 5 is a postsynaptic potential–based kernel over time; in ESVAE, a radial basis function (RBF) kernel is used over rates.

Optimizer: Both employ AdamW. ESVAE uses an elevated learning rate for the bottleneck to optimize the prior rate mapping.

4. Experimental Evaluation

Both models are benchmarked on MNIST, Fashion-MNIST, CIFAR10, and CelebA (resized to $x_t^E \in \{0,1\}^C$ 6 or $x_t^E \in \{0,1\}^C$ 7). Metrics include reconstruction loss (MSE), Inception Score, Fréchet Inception Distance (FID), and Fréchet Autoencoder Distance (FAD):

FSVAE reports improved or matched scores against ANN-VAEs and baseline SNN-VAEs: lower MSE and FID across datasets, with a notable reduction in multiplications during inference (e.g., $x_t^E \in \{0,1\}^C$ 8 for FSVAE vs. $x_t^E \in \{0,1\}^C$ 9 for ANN-VAE) (Kamata et al., 2021).
ESVAE achieves state-of-the-art SNN-based image generation, e.g., on CIFAR10 it halves the MSE (0.045 vs. 0.066) and reduces FID from 175.5 to 127.0 compared to FSVAE (Zhan et al., 2023).
Robustness: ESVAE demonstrates temporal stability—shuffling spikes along the time axis leads to negligible reconstruction loss increase ( $C=128$ 0), whereas FSVAE exhibits higher degradation ( $C=128$ 1). Latent spike noise (up to $C=128$ 2 perturbation) yields minor performance loss in ESVAE.
Distributional Consistency: High overlap between training and sampled latent firing-rate histograms confirms effective alignment with MMD regularization.

5. Algorithmic and Representational Ablations

Ablation experiments in both models highlight the importance of latent construction:

Poisson vs. Autoregressive Bernoulli: ESVAE’s explicit Poisson latent formulation eliminates the need for an autoregressive SNN network in the latent stage, reducing parameter count and improving sample quality. FSVAE’s autoregressive Bernoulli requires auxiliary SNN latents but is compatible with hardware-sparse sampling.
Encoder Information Retention: Freezing the encoder and training an ANN classifier on the latent rate vector $C=128$ 3, ESVAE yields higher accuracy (e.g., 53.6% on CIFAR10 vs. 46.7% for FSVAE), demonstrating superior feature retention.
Loss Function Components: MMD regularization (especially combined with postsynaptic potential kernels) is critical for generative quality; substituting with KL or removing these terms significantly degrades FID.

6. Implications for Neuromorphic Hardware and Future Directions

FSVAE and ESVAE are conceived for direct deployment on neuromorphic hardware (e.g., Intel Loihi, IBM TrueNorth):

Event-Driven Efficiency: Binary spike trains, event-based layers, and spike-based random selection align with neuromorphic hardware features such as built-in RNGs, sparse multiplexers, and massively parallel, low-power computation.
Energy and Latency: Inference requires fewer than $C=128$ 4 time steps, enabling real-time, sub-millisecond image generation. Projected energy requirements are $C=128$ 5– $C=128$ 6 lower than conventional ANN-based implementations, suitable for battery-powered edge applications.
Parameter Economy and Scalability: ESVAE’s Poisson design reduces the number of auxiliary parameters and enables more interpretable, robust latent representations, suggesting a direction for further reductions in resource use and improved robustness in hardware implementations.

A plausible implication is that the transition from autoregressive Bernoulli to Poisson-rate latents represents a substantial step toward scalable, interpretable, and hardware-efficient VAE models for event-driven computation. These advances facilitate high-quality generative modeling on ultra-low–power edge devices while maintaining or exceeding the image quality of traditional neural architectures.

7. Summary of Key Differences and Comparative Outcomes

Aspect	FSVAE (Kamata et al., 2021)	ESVAE (Zhan et al., 2023)
Latent Variable	Autoregressive Bernoulli SNN	Explicit Poisson-rate, no AR network
Sampling	Random-select from SNN outputs	Reparameterizable Poisson trick
Loss Divergence	MMD+PSP kernel on spike trains	MMD on rate vectors (RBF kernel)
Hardware Compatibility	Fully event-driven, RNG supported	Fully event-driven, low parameter
Image Quality	SOTA for SNNs, competitive to ANN	Improved; lower MSE, FID
Robustness	Susceptible to temporal noise	Robust to temporal/manifold noise

Both FSVAE and ESVAE exemplify the integration of generative modeling and neuromorphic computing, enabling VAEs to operate at the intersection of probabilistic learning and energy-efficient event-driven hardware (Zhan et al., 2023, Kamata et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

ESVAE: An Efficient Spiking Variational Autoencoder with Reparameterizable Poisson Spiking Sampling (2023)

Fully Spiking Variational Autoencoder (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fully Spiking Variational Autoencoders (FSVAE, ESVAE).