Fully Spiking Variational Autoencoders
- FSVAE and ESVAE are spiking neural network-based variational autoencoders that use binary spike trains and LIF neurons for efficient, event-driven generative modeling.
- FSVAE employs an autoregressive Bernoulli latent process while ESVAE utilizes a reparameterizable Poisson rate model to improve image quality and reduce parameters.
- Both models achieve energy-efficient, low-latency inference on neuromorphic hardware, demonstrating robust performance on benchmarks like MNIST, CIFAR10, and CelebA.
Fully Spiking Variational Autoencoders (FSVAE) and Efficient Spiking Variational Autoencoders (ESVAE) implement the variational autoencoder (VAE) paradigm entirely within the domain of spiking neural networks (SNNs). These architectures address the challenge of generative modeling on event-driven, neuromorphic-compatible hardware, achieving high-fidelity image generation with extremely low power consumption and latency. By replacing conventional continuous latent variables and floating-point sampling with spiking-based methods, FSVAE and ESVAE enable VAE inference and learning to proceed end-to-end using only binary spike trains, LIF dynamics, and hardware-feasible sampling procedures (Zhan et al., 2023, Kamata et al., 2021).
1. Architectural Foundations
Both FSVAE and ESVAE construct all VAE components—encoder, latent sampler (posterior and prior), and decoder—using standard SNN layers and Leaky Integrate-and-Fire (LIF) neurons. Image data are temporally encoded as spike trains across time steps, e.g., .
- Encoder: Implements a deep spiking convolutional stack. For example, both models use four or five convolutional layers (MNIST, FashionMNIST, CIFAR10: 4 layers, CelebA: 5 layers), each followed by temporal-domain batch normalization (tdBN) and LIF neurons. The outputs are spike vectors per time step: , .
- Decoder: Mirrors the encoder architecture but uses (transposed) convolutional layers. It reconstructs binary spike train outputs , read out as continuous-valued outputs using membrane integration across spikes.
- Temporal Input Encoding: Each input pixel is spike-coded across all time steps proportional to its intensity; this converts analog images to temporally distributed binary events suitable for SNN processing.
| Model | Encoder-Decoder Topology | Input Encoding Method |
|---|---|---|
| FSVAE | Stacked SNN convolution, FC, SNN, transposed conv | Direct pixel-to-spike coding |
| ESVAE | Conv-32C3 → ... → FC-128 → (sampling) → decoder | Spike-repetition; collapse for rates |
2. Latent Variable Construction
FSVAE: Autoregressive Bernoulli Process
FSVAE overcomes the incompatibility between continuous Gaussian latents (as in ANNs) and binary SNN activity by using a learned, autoregressive Bernoulli process in the latent space (Kamata et al., 2021):
- Latent Posterior & Prior: Both are realized as three-layer fully connected SNNs, operating over timesteps, autoregressively sampling a vector per timestep.
- Random-Select Sampling: For each latent block, parallel channels are generated and one is selected at random, ensuring a sample from Bernoulli() for each dimension.
- MMD Divergence: Matching is performed on the sequence of Bernoulli rates using a kernel defined over the postsynaptic potential evolution.
ESVAE: Reparameterizable Poisson Rate Model
ESVAE introduces a Poisson-rate–based latent representation with direct rate parameterization (Zhan et al., 2023):
- Latent Poisson Posterior: Each neuron’s spike count over 0 steps yields a rate 1; the latent distribution is 2 with 3.
- Latent Poisson Prior: Generated via a bottleneck ANN: 4, so 5.
- Reparameterizable Sampling: To generate a spike train 6 from rate 7, sample 8 and set 9 if 0. The surrogate gradient (straight-through estimator) ensures fully differentiable training.
| Model | Latent Variable Family | Sampling Mechanism |
|---|---|---|
| FSVAE | Autoregressive Bernoulli | Random-select from SNN channels |
| ESVAE | Inhomogeneous Poisson | Uniform noise with STR surrogate |
3. Training Objectives and Optimization
Both FSVAE and ESVAE maximize the evidence lower bound (ELBO), tailored for spike trains:
- ELBO Formulation:
1
Standard terms are adapted: - Reconstruction: Mean-squared error (MSE) between input 2 and reconstruction 3 is used for spike-train outputs. - Divergence: The usual KL divergence is replaced by a maximum mean discrepancy (MMD) loss, reflecting the structure and compatibility of spike-based or rate-based latents:
4
In FSVAE, 5 is a postsynaptic potential–based kernel over time; in ESVAE, a radial basis function (RBF) kernel is used over rates.
- Optimizer: Both employ AdamW. ESVAE uses an elevated learning rate for the bottleneck to optimize the prior rate mapping.
4. Experimental Evaluation
Both models are benchmarked on MNIST, Fashion-MNIST, CIFAR10, and CelebA (resized to 6 or 7). Metrics include reconstruction loss (MSE), Inception Score, Fréchet Inception Distance (FID), and Fréchet Autoencoder Distance (FAD):
- FSVAE reports improved or matched scores against ANN-VAEs and baseline SNN-VAEs: lower MSE and FID across datasets, with a notable reduction in multiplications during inference (e.g., 8 for FSVAE vs. 9 for ANN-VAE) (Kamata et al., 2021).
- ESVAE achieves state-of-the-art SNN-based image generation, e.g., on CIFAR10 it halves the MSE (0.045 vs. 0.066) and reduces FID from 175.5 to 127.0 compared to FSVAE (Zhan et al., 2023).
- Robustness: ESVAE demonstrates temporal stability—shuffling spikes along the time axis leads to negligible reconstruction loss increase (0), whereas FSVAE exhibits higher degradation (1). Latent spike noise (up to 2 perturbation) yields minor performance loss in ESVAE.
- Distributional Consistency: High overlap between training and sampled latent firing-rate histograms confirms effective alignment with MMD regularization.
5. Algorithmic and Representational Ablations
Ablation experiments in both models highlight the importance of latent construction:
- Poisson vs. Autoregressive Bernoulli: ESVAE’s explicit Poisson latent formulation eliminates the need for an autoregressive SNN network in the latent stage, reducing parameter count and improving sample quality. FSVAE’s autoregressive Bernoulli requires auxiliary SNN latents but is compatible with hardware-sparse sampling.
- Encoder Information Retention: Freezing the encoder and training an ANN classifier on the latent rate vector 3, ESVAE yields higher accuracy (e.g., 53.6% on CIFAR10 vs. 46.7% for FSVAE), demonstrating superior feature retention.
- Loss Function Components: MMD regularization (especially combined with postsynaptic potential kernels) is critical for generative quality; substituting with KL or removing these terms significantly degrades FID.
6. Implications for Neuromorphic Hardware and Future Directions
FSVAE and ESVAE are conceived for direct deployment on neuromorphic hardware (e.g., Intel Loihi, IBM TrueNorth):
- Event-Driven Efficiency: Binary spike trains, event-based layers, and spike-based random selection align with neuromorphic hardware features such as built-in RNGs, sparse multiplexers, and massively parallel, low-power computation.
- Energy and Latency: Inference requires fewer than 4 time steps, enabling real-time, sub-millisecond image generation. Projected energy requirements are 5–6 lower than conventional ANN-based implementations, suitable for battery-powered edge applications.
- Parameter Economy and Scalability: ESVAE’s Poisson design reduces the number of auxiliary parameters and enables more interpretable, robust latent representations, suggesting a direction for further reductions in resource use and improved robustness in hardware implementations.
A plausible implication is that the transition from autoregressive Bernoulli to Poisson-rate latents represents a substantial step toward scalable, interpretable, and hardware-efficient VAE models for event-driven computation. These advances facilitate high-quality generative modeling on ultra-low–power edge devices while maintaining or exceeding the image quality of traditional neural architectures.
7. Summary of Key Differences and Comparative Outcomes
| Aspect | FSVAE (Kamata et al., 2021) | ESVAE (Zhan et al., 2023) |
|---|---|---|
| Latent Variable | Autoregressive Bernoulli SNN | Explicit Poisson-rate, no AR network |
| Sampling | Random-select from SNN outputs | Reparameterizable Poisson trick |
| Loss Divergence | MMD+PSP kernel on spike trains | MMD on rate vectors (RBF kernel) |
| Hardware Compatibility | Fully event-driven, RNG supported | Fully event-driven, low parameter |
| Image Quality | SOTA for SNNs, competitive to ANN | Improved; lower MSE, FID |
| Robustness | Susceptible to temporal noise | Robust to temporal/manifold noise |
Both FSVAE and ESVAE exemplify the integration of generative modeling and neuromorphic computing, enabling VAEs to operate at the intersection of probabilistic learning and energy-efficient event-driven hardware (Zhan et al., 2023, Kamata et al., 2021).