PatchVAE: Patch-Based VAE
- The paper introduces a patch-based bottleneck that decomposes images into discrete occurrence maps and continuous appearance codes.
- The architecture employs a convolutional backbone with dual prediction heads to model appearance and occurrence, enhancing feature extraction.
- Experiments on datasets like CIFAR-100 and ImageNet show improved recognition performance, highlighting a trade-off between generative fidelity and discriminative power.
PatchVAE is a variational auto-encoder (VAE) framework designed for unsupervised representation learning by modeling images at the patch or part level, rather than encoding entire images into global latent variables. PatchVAE introduces a patch-based bottleneck that encourages representations to focus on recurring, semantically meaningful patterns within images. Each image is encoded as a small, fixed set of mid-level parts, with each part captured by a discrete spatial occurrence map and a continuous appearance code. This architectural and probabilistic structure leads to features that are more effective for downstream recognition tasks compared to standard VAE formulations, while requiring only reconstruction-based training (Gupta et al., 2020).
1. Architectural Components and Posterior Factorization
PatchVAE diverges from canonical VAEs by decomposing each image into “parts,” where each part is defined by two variables:
- A continuous appearance code sampled once per part per image (shared across locations).
- A discrete occurrence map over a regular spatial grid of locations.
Given an input image , a convolutional backbone produces a feature map . Two prediction heads output the Bernoulli probability maps (for occurrence) and Gaussian parameters (for appearance) for each part at each location 0.
At sampling time, for each part 1, appearance codes 2 and occurrence samples 3 are drawn. The patch-latent feature map per part is 4, and all 5 such maps are concatenated along the channel dimension to yield 6. A lightweight deconvolutional decoder 7 maps 8 to the reconstructed image 9.
The encoder posterior thus factorizes as
0
while the decoder implements
1
This formulation enforces a mid-level, part-based representation, with independence assumptions for tractable modeling and sampling (Gupta et al., 2020).
2. Objective Function and Patch-Level Bottleneck
PatchVAE is trained via a VAE-style negative evidence lower bound (ELBO), extended to the local, part-based latent structure: 2 The KL term decomposes analytically given the modeling independence: 3 where 4 is a sparsity-inducing Bernoulli prior. Separate weights 5, 6 may be used. The patch-level bottleneck arises from two factors:
- The appearance code for each part is sampled once and shared across locations.
- The occurrence map is forced to be sparse, encouraging only a few active part codes per image.
This bottleneck incentivizes the emergence of semantic, style-like part codes and filtering out low-level noise, focusing the representation on recurring patterns such as object parts (Gupta et al., 2020).
3. Weighted Reconstruction and Training Protocols
PatchVAE introduces a foreground-weighted reconstruction loss to direct model capacity towards semantically salient, high-detail regions. The unweighted term
7
can be replaced by
8
where 9 is the normalized local gradient magnitude, accentuating textured and foreground areas.
PatchVAE is evaluated on CIFAR-100, MIT Indoor67, Places205, and ImageNet (resized to 0). Patch extraction operates on convolutional feature maps (1), never on raw pixel grids. Unsupervised pretraining uses the Adam optimizer (lr = 1e–4, batch = 128) with dataset-specific epochs and schedule for the Relaxed-Bernoulli temperature. For recognition benchmarks, supervised fine-tuning freezes various numbers of initial convolutional layers while training added fully connected layers with SGD (momentum = 0.9, lr drop every 30 epochs) (Gupta et al., 2020).
4. Recognition Performance and Quantitative Results
Following unsupervised pretraining, the decoder is removed and a classifier is placed atop the backbone 2. Three fine-tuning “freeze-schedules” are employed:
- Freeze only Conv1, fine-tune rest.
- Freeze Conv1–3, fine-tune Conv4–5 + classifier.
- Freeze all except the final classifier.
Table 1 gives Top-1% accuracy on CIFAR100/Indoor67/Places205:
| Model | Conv1 | Conv[1–3] | Conv[1–5] |
|---|---|---|---|
| β‐VAE | 44.1 | 39.7 | 28.6 |
| β‐VAE + 3 | 44.9 | 40.3 | 28.3 |
| PatchVAE | 43.1 | 38.6 | 28.7 |
| PatchVAE + 4 | 43.8 | 40.4 | 30.6 |
| BiGAN | 47.7 | 41.9 | 31.6 |
| ImageNet-sup. pre-train | 56.0 | 55.0 | 54.4 |
On ImageNet (ResNet-18 backbone):
| Model | Top-1 % | Top-5 % |
|---|---|---|
| β‐VAE | 44.5 | 69.7 |
| PatchVAE | 47.0 | 71.7 |
| β‐VAE + 5 | 47.3 | 71.8 |
| PatchVAE + 6 | 47.9 | 72.5 |
| Supervised ImageNet | 61.4 | 83.8 |
PatchVAE consistently surpasses standard VAE baselines and closely approaches adversarial-based models (BiGAN) for recognition, despite relying purely on reconstruction loss (Gupta et al., 2020).
5. Qualitative Analysis, Part Semantics, and Trade-offs
PatchVAE's occurrence maps highlight consistent mid-level semantics. For instance, certain parts consistently activate on round objects (e.g., fruit, wheels) or heads and faces in both CIFAR100 and ImageNet. Cropped regions with high occurrence probability for individual parts exhibit semantic consistency (e.g., groupings such as “car-windows” or “animal-ears”), all discovered unsupervised.
Part appearance swapping—substituting 7 between images while fixing the occurrence maps—transfers stylistic content of part 8 to novel contexts, demonstrating the compositional and disentangled properties of the learned codes.
There is a documented generative-discriminative trade-off: PatchVAE underperforms β-VAE in pixelwise reconstructions (measured by PSNR/FID/SSIM), confirming that it sacrifices exact image generation fidelity for more discriminative representations. This corroborates the patch-based bottleneck’s effect of emphasizing repeatable, semantically meaningful structures over low-level noise (Gupta et al., 2020).
6. Limitations and Prospects for Extension
PatchVAE employs a fixed number 9 of parts per image; this inflexibility may impair adaptation to varying image complexity. Potential avenues for overcoming this limitation include parameterizing the number of parts with nonparametric priors (Poisson or stick-breaking process).
Currently, the part appearance code is unimodal Gaussian. Mixture or hierarchical models could better capture multi-modal or structured part appearances. Further, extending PatchVAE to handle hierarchical patch arrangements or to operate over temporal patches in video is proposed as a pathway to even richer representations.
PatchVAE represents a principled advancement in patch-based unsupervised representation learning. By leveraging a sparse, part-structured latent code with shared appearance and spatial occurrence, it enables significant improvements in recognition tasks under purely unsupervised regimes, at modest expense to generative quality (Gupta et al., 2020).