SC-VAE: Sparse Compression Variational Autoencoder
- SC-VAE is a model that merges sparse coding with variational autoencoder frameworks, using learned ISTA to enforce sparsity in latent representations.
- It employs a fixed orthogonal dictionary and an L1 sparse prior to ensure robust reconstruction and control over feature manipulation for diverse data modalities.
- Experimental results demonstrate superior performance in terms of reconstruction metrics and compression efficiency on image and point cloud tasks.
Sparse Compression Variational Autoencoder (SC-VAE) encompasses a family of models that integrate sparse data representations with variational autoencoder (VAE) frameworks. The principal SC-VAE variant leverages learned sparse coding, operationalized via trainable iterative shrinkage-thresholding algorithms, to produce interpretable latent structures and superior quantitative performance for a range of data modalities, including natural images and point clouds. Two representative instantiations are: (1) the Sparse Coding-based VAE with Learned ISTA for image tasks, and (2) the Sparse Tensor-based VAE with sparse convolutions for point cloud attribute compression.
1. Model Formulation and Generative Framework
The SC-VAE paradigm enforces sparsity in latent representations while maintaining compatibility with deep generative modeling. For image data (Xiao et al., 2023), the model is defined as follows:
- Generative Model: Given sparse latent code , the decoder reconstructs input with an isotropic Gaussian likelihood:
where is a fixed orthogonal dictionary (typically DCT basis), and .
- Inference Model: The deterministic encoder produces feature , then applies learned ISTA (LISTA) to estimate :
yielding an approximate posterior .
- Sparse Prior: An (Laplace) prior is imposed on codes:
For point cloud attribute compression (Wang et al., 2022), SC-VAE utilizes sparse tensors representing point attributes, with sparse convolutions forming the encoder and decoder.
2. Sparse Coding and Learnable ISTA Algorithms
Sparse coding aims to represent high-dimensional vectors as a sparse linear combination of dictionary atoms. In SC-VAE (Xiao et al., 2023):
- Sparse Coding Objective (per feature vector ):
- Learnable ISTA (LISTA): The sparse coding problem is solved by unrolling ISTA for iterations:
with learnable parameters . The initial .
Algorithm 1: Learnable ISTA (LISTA)
Input: , ,
- Initialize
- For to :
- Return
3. Network Architectures and Computational Aspects
For image SC-VAE (Xiao et al., 2023):
- Encoder/Decoder: Follows the VQGAN backbone: stacked ResidualConv, GroupNorm, Swish, Down/Up-sampling blocks; latent dimension .
- Dictionary: Fixed DCT, atoms, tied across layers.
- Training: Adam optimizer, learning rate , batch size $16$, $10$ epochs.
For point cloud SC-VAE (Wang et al., 2022):
- Encoder: Six sparse 3D convolutions, input $3$ channels (RGB) to $128$, final bottleneck per occupied voxel with $128$ features.
- Decoder: Mirrors encoder with transposed convolutions.
- Entropy Model: Hyper-encoder/decoder operating on sparse tensors, context model for Laplace parameter estimation.
- Computational Complexity: Total M parameters, runtime/memory scales linearly with number of points.
4. Training Objectives and Variational Framework
SC-VAE for images (Xiao et al., 2023) uses a composite loss:
- Pixel-level Reconstruction:
- Latent-space Sparse Coding (averaged over tokens):
- Total Objective:
Approximates ELBO: , .
For point cloud SC-VAE (Wang et al., 2022):
- ELBO:
- Rate-Distortion: , .
5. Experimental Evaluation
SC-VAE demonstrates strong empirical performance across tasks.
- Image Reconstruction (FFHQ, ImageNet) (Xiao et al., 2023):
| Model | PSNR | SSIM | LPIPS | rFID |
|---|---|---|---|---|
| SC-VAE (FFHQ) | 34.92 | 0.9497 | 0.0080 | 4.21 |
| SC-VAE (ImageNet) | 38.40 | 0.9688 | 0.0070 | 0.71 |
- Qualitative Results: SC-VAE preserves fine details (leaves, textures) better than VQGAN/RQ-VAE and generalizes to out-of-distribution inputs.
- Attribute Manipulation: Varying sparse code components yields controlled changes in pose, lighting, style; smooth interpolation leads to coherent morphing.
- Unsupervised Segmentation: K-means clustering on patch-level codes leads to accurate segmentation (IoU up to 81.2%).
- Ablations (rollout steps ):
| PSNR | Sparsity | |
|---|---|---|
| 1 | 27.3 | 89.5% |
| 5 | 31.13 | 71.9% |
| 16 | 31.41 | 74.9% |
| 25 | drop | denser |
- Compression for Point Clouds (Wang et al., 2022):
| Baseline | BD-BR Reduction | BD-PSNR Gain (Y) |
|---|---|---|
| TMC13v6 | 24% | +0.97 dB |
| RAHT | 34% | +1.38 dB |
Qualitative results confirm fewer blocking artifacts and smooth reconstructions in the point cloud modality.
6. Mitigating Posterior and Codebook Collapse
SC-VAE addresses common VAE failures:
- Posterior Collapse: The encoder produces features that must admit sparse reconstruction. The sparse penalty and multi-stage training preclude trivial decoder behavior.
- Codebook Collapse (VQ-VAEs): Fixed orthogonal dictionary (e.g., DCT) ensures no dead atoms, and differentiable thresholding further mitigates collapse.
A plausible implication is that the SC-VAE provides a principled balancing point between continuous (dense Gaussian codes) and discrete VAEs (one-hot quantization), producing interpretable, well-behaved latent spaces.
7. Applicability and Significance
SC-VAE yields disentangled sparse codes amenable to downstream tasks such as:
- Image Generation and Morphing: Direct, interpretable code manipulation.
- Unsupervised Clustering and Segmentation: Patch-wise codes facilitate clustering (e.g., spectral clustering, k-means), outperforming prior sparse/quantized VAEs in IoU for medical and natural images.
- Robustness: Graceful degradation under Gaussian noise, especially for small .
- Compression: For point cloud attributes, SC-VAE outperforms G-PCC v6 and RAHT, offers competitive visual quality with G-PCC v14, and runs in real time on commodity hardware.
SC-VAE constitutes an end-to-end learned codec and generative representation framework, combining sparse coding principles with deep variational modeling, and operationalized via differentiable solvers such as LISTA for scalable, interpretable, and robust latent representations (Xiao et al., 2023, Wang et al., 2022).