Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spike-and-Slab Sparse Coding (S3C)

Updated 25 February 2026
  • Spike-and-Slab Sparse Coding (S3C) is a probabilistic latent variable model that combines spike-and-slab priors with directed sparse coding to control sparsity and amplitude.
  • It employs a structured variational EM procedure with parallel, GPU-friendly updates to efficiently learn features and decompose signals.
  • S3C achieves competitive performance in low-label and transfer learning scenarios, demonstrating state-of-the-art accuracy in image classification tasks.

Spike-and-Slab Sparse Coding (S3C) is a probabilistic latent variable model combining spike-and-slab priors with a directed sparse coding architecture. It forms a highly regularized framework for unsupervised feature learning and signal decomposition, enabling decoupled control over sparsity and magnitude of latent activations. S3C has been demonstrated to provide state-of-the-art feature representations, especially in low-label and transfer-learning regimes, and admits scalable variational inference procedures well-suited for GPU acceleration (Goodfellow et al., 2012, Goodfellow et al., 2012).

1. Generative Model: Architecture and Priors

S3C models observed data vectors xRDx \in \mathbb{R}^D as generated by NN latent "spike-and-slab" units. For each factor i=1,,Ni=1,\dots,N:

  • Spike prior: Each spike variable zi{0,1}z_i \in \{0,1\} is drawn independently from a Bernoulli,

p(zi=1)=σ(bi),p(zi=0)=1σ(bi),p(z_i=1) = \sigma(b_i), \qquad p(z_i=0) = 1-\sigma(b_i),

with σ()\sigma(\cdot) denoting the logistic sigmoid and bib_i a learned bias.

  • Slab prior: Given ziz_i, the real-valued slab hiRh_i \in \mathbb{R} is Gaussian:

p(hizi)={N(hiμi,αi1)if zi=1 N(hi0,αi1)if zi=0p(h_i \mid z_i) = \begin{cases} \mathcal{N}(h_i \mid \mu_i, \alpha_i^{-1}) & \text{if } z_i=1 \ \mathcal{N}(h_i \mid 0, \alpha_i^{-1}) & \text{if } z_i=0 \end{cases}

where μi\mu_i is the slab mean (when the spike is active) and αi\alpha_i is precision.

  • Observation model: The visible data is generated as

p(xz,h)=d=1DN(xdWd:(zh), βd1)p(x \mid z, h) = \prod_{d=1}^D \mathcal{N} \left( x_d \mid W_{d:}(z \circ h), \ \beta_d^{-1} \right)

where WRD×NW\in\mathbb{R}^{D\times N} is a dictionary, zhz \circ h denotes elementwise product, and βd\beta_d is the noise precision (often isotropic or diagonal).

The full joint is

p(x,z,h)=i=1Nσ(bi)zi[1σ(bi)]1ziN(hiziμi,αi1)  d=1DN(xdWd:(zh),βd1)p(x, z, h) = \prod_{i=1}^N \sigma(b_i)^{z_i} [1-\sigma(b_i)]^{1-z_i} \, \mathcal{N}(h_i \mid z_i\mu_i, \alpha_i^{-1}) \; \prod_{d=1}^D \mathcal{N}(x_d \mid W_{d:}(z\circ h), \beta_d^{-1})

The spike variable ziz_i gates the contribution of hih_i to reconstruction, yielding strict control over sparsity, while the slab provides amplitude modulation (Goodfellow et al., 2012, Goodfellow et al., 2012).

2. Approximate Inference: Structured Variational EM

Exact posterior inference for p(z,hx)p(z, h|x) is intractable due to the explaining-away interactions among spikes. S3C employs a structured mean-field variational posterior of the form

q(z,h)=i=1Nqi(zi,hi)q(z, h) = \prod_{i=1}^N q_i(z_i, h_i)

where qi(zi,hi)q_i(z_i, h_i) is tightly coupled, but factors across ii. The optimal form is given by

qi(zi)=z^izi(1z^i)1zi,qi(hizi)=N(hizih^i,(αi+ziW:iβW:i)1)q_i(z_i) = \hat z_i^{z_i}(1-\hat z_i)^{1-z_i}, \quad q_i(h_i|z_i) = \mathcal{N}\Bigl(h_i \mid z_i\hat h_i, (\alpha_i + z_i W_{:i}^\top \beta W_{:i})^{-1}\Bigr)

with z^i(0,1)\hat z_i\in(0,1) and h^iR\hat h_i\in\mathbb{R} as variational parameters.

Fixed-point updates for these parameters are:

  • Slab-mean update:

h^i=μiαi+W:iβriαi+W:iβW:i\hat h_i = \frac{ \mu_i \alpha_i + W_{:i}^\top \beta r_i }{ \alpha_i + W_{:i}^\top \beta W_{:i} }

  • Spike-probability update:

z^i=σ(W:iβrih^i+bi12αi(h^iμi)212log(αi+W:iβW:i)+12logαi)\hat z_i = \sigma\left( W_{:i}^\top \beta r_i \hat h_i + b_i - \frac{1}{2} \alpha_i (\hat h_i - \mu_i)^2 - \frac{1}{2} \log(\alpha_i + W_{:i}^\top \beta W_{:i}) + \frac{1}{2} \log \alpha_i \right)

with residual ri=xjiW:jz^jh^jr_i = x - \sum_{j \ne i} W_{:j} \hat z_j \hat h_j (Goodfellow et al., 2012). Updates employ parallelization, damping, and clipping for numerical stability—enabling fully vectorized GPU implementations (Goodfellow et al., 2012).

3. Learning: Parameter Estimation via Variational EM

Parameters θ={W,β,b,μ,α}\theta=\{W, \beta, b, \mu, \alpha\} are learned by maximizing the variational lower bound (evidence lower bound, ELBO) via variational EM:

  • E-step: Run the above fixed-point updates to obtain variational parameters {z^,h^}\{\hat z, \hat h\} for each data point.
  • M-step: Maximize the expected complete-data log-likelihood

Q(θ)=Eq[logp(x,z,hθ)]Q(\theta) = \mathbb{E}_q [ \log p(x, z, h | \theta) ]

Closed-form updates exist for WW, β\beta, bb, μ\mu, α\alpha, though in practice small gradient steps are often preferred for stability (Goodfellow et al., 2012).

  • WW is updated (with column normalization) via:

W:inβ[x(n)jiW:jz^j(n)h^j(n)]z^i(n)h^i(n)W_{:i} \propto \sum_n \beta \bigl[ x^{(n)} - \sum_{j\neq i} W_{:j} \hat z_j^{(n)} \hat h_j^{(n)} \bigr] \hat z_i^{(n)} \hat h_i^{(n)}

Analogous analytic updates are provided for noise, biases, and slab parameters.

The E- and M-steps are alternated until convergence. Convergence in the E-step typically requires only a small number of parallel iterations (Goodfellow et al., 2012, Goodfellow et al., 2012).

4. Computational Scalability and Parallel Inference

S3C's GPU-adapted variational inference is based on fully parallel updating of all spike and slab parameters, with per-variable damping and sign-flip clipping. Each E-step iteration consists of batched matrix-vector operations and non-linearities, decomposing into parallelizable BLAS calls (Goodfellow et al., 2012, Goodfellow et al., 2012). The algorithmic structure is:

  • Initialize z^iσ(bi)\hat z_i \gets \sigma(b_i) and h^iμi\hat h_i \gets \mu_i.
  • For KK iterations:

    1. Compute h^i\hat h_i^* for all ii in parallel, apply clipping and damping.
    2. Compute z^i\hat z_i^* for all ii in parallel, apply damping.

This approach allows scaling to thousands of latent factors (up to N=8000N=8000 demonstrated), tens of millions of image patches, and large batch feature extraction (Goodfellow et al., 2012).

5. Applications: Feature Discovery and Classification Performance

S3C is principally used as an unsupervised feature learner for image classification, transfer learning, and semi-supervised learning scenarios (Goodfellow et al., 2012, Goodfellow et al., 2012). The standard processing pipeline on images is:

  1. Extract normalized, whitened patches (e.g., 6×66\times 6).

  2. Run S3C variational inference per patch to obtain Eq[z]\mathbb{E}_q[z] activations.
  3. Pool activations spatially on a coarse grid (e.g., 3×33 \times 3), yielding high-dimensional feature vectors.
  4. Train a linear SVM on pooled features for classification.

On CIFAR-10, S3C with 3x3 pooling and N=1600N=1600 factors achieved 78.3%±0.9%78.3\%\pm0.9\% accuracy, competitive with state-of-the-art sparse coding (78.8%\approx 78.8\%) and outperforming spike-and-slab RBMs (76.7%±0.9%76.7\%\pm0.9\%). On the "self-taught" Transfer-Learning Challenge, S3C won the competition with 48.6%48.6\% accuracy using only 120 labels and 100,000 unlabeled samples (Goodfellow et al., 2012). In low-label regimes, S3C outperforms both raw-pixel and logistic models due to flexible regularization.

S3C combines gated continuous latents (from sparse coding) with the explicit spike-and-slab prior (from spike-and-slab RBMs), providing independent control of sparsity (via bib_i) and scale (via μi\mu_i, αi\alpha_i). As a directed model, S3C features a tractable partition function, avoiding the intractability of undirected models like RBMs and enabling efficient variational inference. The variational E-step captures some, though not all, posterior dependencies (“explaining-away” among spikes), surpassing fully factored mean-field approaches in tasks such as source separation and denoising (Sheikh et al., 2012, Lücke et al., 2011).

In contrast, MAP or greedy algorithms often employ convex relaxations (e.g., LASSO) or combinatorial support selection (as in adaptive ADMM methods), but do not model the full latent uncertainty structure of S3C (Bayisa et al., 2018).

7. Extensions and Empirical Observations

Empirical studies show the truncated EM approach—where the posterior is truncated to the most probable spike patterns—outperforms factored variational inference, particularly under high noise or for highly non-orthogonal dictionaries, due to its better approximation of multi-modal and correlated posterior mass (Sheikh et al., 2012). S3C continues to improve with increased latent dimensionality, unlike standard factored methods where performance often saturates or degrades.

Experiments on source separation, denoising, and image classification consistently demonstrate the value of the spike-and-slab framework in inducing both accurate and highly sparse representations (Goodfellow et al., 2012, Sheikh et al., 2012). The model's GPU-friendly inference and scalability enable applications to modern large-scale recognition and transfer-learning challenges (Goodfellow et al., 2012).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spike-and-Slab Sparse Coding (S3C).