SC-VAE: Sparse Compression Variational Autoencoder

Updated 22 January 2026

SC-VAE is a model that merges sparse coding with variational autoencoder frameworks, using learned ISTA to enforce sparsity in latent representations.
It employs a fixed orthogonal dictionary and an L1 sparse prior to ensure robust reconstruction and control over feature manipulation for diverse data modalities.
Experimental results demonstrate superior performance in terms of reconstruction metrics and compression efficiency on image and point cloud tasks.

Sparse Compression Variational Autoencoder (SC-VAE) encompasses a family of models that integrate sparse data representations with variational autoencoder (VAE) frameworks. The principal SC-VAE variant leverages learned sparse coding, operationalized via trainable iterative shrinkage-thresholding algorithms, to produce interpretable latent structures and superior quantitative performance for a range of data modalities, including natural images and point clouds. Two representative instantiations are: (1) the Sparse Coding-based VAE with Learned ISTA for image tasks, and (2) the Sparse Tensor-based VAE with sparse convolutions for point cloud attribute compression.

1. Model Formulation and Generative Framework

The SC-VAE paradigm enforces sparsity in latent representations while maintaining compatibility with deep generative modeling. For image data (Xiao et al., 2023), the model is defined as follows:

Generative Model: Given sparse latent code $s$ , the decoder $G_\theta$ reconstructs input $x$ with an isotropic Gaussian likelihood:

$p_\theta(x \mid s) = \mathcal{N}\left(x ; G_\theta(D s), \sigma^2 I\right)$

where $D \in \mathbb{R}^{n \times K}$ is a fixed orthogonal dictionary (typically DCT basis), and $\sigma^2=1$ .

Inference Model: The deterministic encoder $E_\phi$ produces feature $z = E_\phi(x)$ , then applies learned ISTA (LISTA) to estimate $\hat s$ :

$\hat s = \mathrm{LISTA}_\phi(z)$

yielding an approximate posterior $q_\phi(s|x) \approx \delta(s-\hat s)$ .

Sparse Prior: An $L_1$ (Laplace) prior is imposed on codes:

$p(s) \propto \exp(-\alpha \|s\|_1)$

For point cloud attribute compression (Wang et al., 2022), SC-VAE utilizes sparse tensors representing point attributes, with sparse convolutions forming the encoder and decoder.

2. Sparse Coding and Learnable ISTA Algorithms

Sparse coding aims to represent high-dimensional vectors as a sparse linear combination of dictionary atoms. In SC-VAE (Xiao et al., 2023):

Sparse Coding Objective (per feature vector $z$ ):

$\mathcal{E}(z, s) = \tfrac{1}{2} \|z - D s\|_2^2 + \alpha \|s\|_1$

Learnable ISTA (LISTA): The sparse coding problem is solved by unrolling ISTA for $T$ iterations:

$s^{(t+1)} = \operatorname{soft}_{\lambda_t}(s^{(t)} + \eta_t D^\top[z - D s^{(t)}])$

with learnable parameters $\{\eta_t, \lambda_t\}$ . The initial $s^{(0)}=0$ .

Algorithm 1: Learnable ISTA (LISTA)

Input: $z \in \mathbb{R}^n$ , $D \in \mathbb{R}^{n \times K}$ , $\{\eta_t, \lambda_t\}_{t=0}^{T-1}$

Initialize $s^{(0)} \leftarrow 0$

For $t = 0$ to $T-1$ :

$u^{(t)} \leftarrow s^{(t)} + \eta_t D^\top [z - D s^{(t)}]$

$s^{(t+1)} \leftarrow \operatorname{soft}_{\lambda_t}(u^{(t)})$

Return $\hat s = s^{(T)}$

3. Network Architectures and Computational Aspects

For image SC-VAE (Xiao et al., 2023):

Encoder/Decoder: Follows the VQGAN backbone: stacked ResidualConv, GroupNorm, Swish, Down/Up-sampling blocks; latent dimension $n=256$ .
Dictionary: Fixed DCT, $K=512$ atoms, tied across layers.
Training: Adam optimizer, learning rate $10^{-4}$ , batch size $16$, $10$ epochs.

For point cloud SC-VAE (Wang et al., 2022):

Encoder: Six sparse 3D convolutions, input $3$ channels (RGB) to $128$, final bottleneck per occupied voxel with $128$ features.
Decoder: Mirrors encoder with transposed convolutions.
Entropy Model: Hyper-encoder/decoder operating on sparse tensors, context model for Laplace parameter estimation.
Computational Complexity: Total $\sim3.2$ M parameters, runtime/memory scales linearly with number of points.

4. Training Objectives and Variational Framework

SC-VAE for images (Xiao et al., 2023) uses a composite loss:

Pixel-level Reconstruction:

$\mathcal{L}_{\text{rec}} = \|x - G_\theta(D \hat s)\|_2^2$

Latent-space Sparse Coding (averaged over tokens):

$\mathcal{L}_{\text{sparse}} = \frac{1}{h w} \sum_{i,j} [\tfrac{1}{2} \|z_{ij} - D s_{ij}\|_2^2 + \alpha \|s_{ij}\|_1]$

Total Objective:

$\mathcal{L}(\theta, \phi) = \mathcal{L}_{\text{rec}} + \mathcal{L}_{\text{sparse}}$

Approximates ELBO: $-\log p_\theta(x|s) \approx \|x-G(Ds)\|^2$ , $D_{KL}[q(s|x) || p(s)] \approx \alpha \|s\|_1$ .

For point cloud SC-VAE (Wang et al., 2022):

ELBO: $\operatorname{E}_{q_\phi(z|x)}[-\log p_\theta(x|z)] + KL(q_\phi(z|x) \| p(z))$
Rate-Distortion: $R + \lambda D$ , $D = \sum_j \|x_j - \hat{x}_j\|_2^2$ .

5. Experimental Evaluation

SC-VAE demonstrates strong empirical performance across tasks.

Image Reconstruction (FFHQ, ImageNet) (Xiao et al., 2023):

Model	PSNR	SSIM	LPIPS	rFID
SC-VAE (FFHQ)	34.92	0.9497	0.0080	4.21
SC-VAE (ImageNet)	38.40	0.9688	0.0070	0.71

Qualitative Results: SC-VAE preserves fine details (leaves, textures) better than VQGAN/RQ-VAE and generalizes to out-of-distribution inputs.
Attribute Manipulation: Varying sparse code components yields controlled changes in pose, lighting, style; smooth interpolation leads to coherent morphing.
Unsupervised Segmentation: K-means clustering on patch-level codes $s_{ij}$ leads to accurate segmentation (IoU up to 81.2%).
Ablations (rollout steps $T$ ):

$T$	PSNR	Sparsity
1	27.3	89.5%
5	31.13	71.9%
16	31.41	74.9%
$\geq$ 25	drop	denser

Compression for Point Clouds (Wang et al., 2022):

Baseline	BD-BR Reduction	BD-PSNR Gain (Y)
TMC13v6	24%	+0.97 dB
RAHT	34%	+1.38 dB

Qualitative results confirm fewer blocking artifacts and smooth reconstructions in the point cloud modality.

6. Mitigating Posterior and Codebook Collapse

SC-VAE addresses common VAE failures:

Posterior Collapse: The encoder produces features that must admit sparse reconstruction. The sparse penalty $\|s\|_1$ and multi-stage training preclude trivial decoder behavior.
Codebook Collapse (VQ-VAEs): Fixed orthogonal dictionary $D$ (e.g., DCT) ensures no dead atoms, and differentiable thresholding further mitigates collapse.

A plausible implication is that the SC-VAE provides a principled balancing point between continuous (dense Gaussian codes) and discrete VAEs (one-hot quantization), producing interpretable, well-behaved latent spaces.

7. Applicability and Significance

SC-VAE yields disentangled sparse codes amenable to downstream tasks such as:

Image Generation and Morphing: Direct, interpretable code manipulation.
Unsupervised Clustering and Segmentation: Patch-wise codes facilitate clustering (e.g., spectral clustering, k-means), outperforming prior sparse/quantized VAEs in IoU for medical and natural images.
Robustness: Graceful degradation under Gaussian noise, especially for small $\sigma$ .
Compression: For point cloud attributes, SC-VAE outperforms G-PCC v6 and RAHT, offers competitive visual quality with G-PCC v14, and runs in real time on commodity hardware.

SC-VAE constitutes an end-to-end learned codec and generative representation framework, combining sparse coding principles with deep variational modeling, and operationalized via differentiable solvers such as LISTA for scalable, interpretable, and robust latent representations (Xiao et al., 2023, Wang et al., 2022).

Markdown Upgrade to Chat

References (2)

SC-VAE: Sparse Coding-based Variational Autoencoder with Learned ISTA (2023)

Sparse Tensor-based Point Cloud Attribute Compression (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Compression Variational Autoencoder (SC-VAE).