Supervised Encoding Quantizers (SEQ)
- SEQ is a supervised learning framework that integrates discrete representation learning and interpretable clustering to yield semigenerative, graph-structured latent spaces.
- Its modular design—comprising an encoder, a k-means based quantizer, and a decoder—facilitates high cluster purity and controllable style interpolation.
- The approach offers improved interpretability and low-complexity classification but requires careful tuning of the cluster count to balance performance and computational cost.
Supervised Encoding Quantizers (SEQ) constitute a supervised learning framework that combines discrete representation learning with interpretable clustering and semigenerative capabilities. SEQ departs from the classical paradigm, where features are directly mapped to label probabilities, by explicitly clustering encoded features and leveraging quantization to yield both interpretable graph-structured representations and controllable style interpolation. Discrete cluster assignments correspond to "styles" or sub-classes, providing semantic structure and transparency to the latent space. The approach was introduced and developed in "Supervised Encoding for Discrete Representation Learning" (Le et al., 2019).
1. Model Architecture and Key Components
SEQ comprises three core modules: encoder, quantizer, and decoder, supported optionally by a classification head.
- Encoder : A feed-forward or convolutional neural network mapping each input to an embedding . Typical architectures include MLPs of depth 2 or 4 ("LAE-2", "LAE-4") or a convolutional frontend plus MLP ("CAE-4"). Initial encoder training uses a conventional cross-entropy objective with a softmax classification layer attached.
- Quantizer : Once encoder parameters are fixed, all embeddings are clustered using -means, yielding cluster centers . Quantization is performed by hard-assignment:
The cluster assignment can be represented by the one-hot vector 0.
- Decoder 1: A symmetric network (MLP or deconv+MLP) that reconstructs the input from latent embeddings or cluster centers: 2. Decoder training minimizes mean squared error (MSE) between 3 and 4.
- Style-mixing via Convex Combination: Embeddings 5 from different clusters (or the same) can be linearly interpolated as 6 with 7, enabling smooth traversal and "style transfer" in latent space.
2. Training Workflow and Losses
SEQ employs a three-stage training protocol, with the option for joint optimization.
- Encoder Pre-training: Encoder 8 (and 9) are optimized via the standard classification loss:
0
where 1.
- Quantizer Fitting: Fix 2, compute 3. Run 4-means to minimize:
5
Assign each 6 to its nearest cluster 7.
- Decoder Training: With 8 fixed, train 9 by minimizing reconstruction loss:
0
or, for strictly quantized codes, using 1.
Optionally, a classification head 2 can be trained on the one-hot quantized codes with a loss:
3
A joint end-to-end objective is possible:
4
with hyperparameters 5 controlling trade-offs. In practice, the original scheme sets 6 and optimizes each stage sequentially.
SEQ Training Pipeline (Pseudocode)
8
3. Clustering Structure and Interpretability
SEQ’s quantizer induces a discrete cluster graph in latent space where each codebook center 7 forms a node. Edges may be constructed between nodes whose centers are within a threshold: 8. Due to the supervised nature of training, each cluster node predominantly contains samples of a single class, and often, distinct clusters correspond to sub-styles or sub-classes within a label.
Cluster quality and alignment with semantic labels are assessed using clustering purity
9
as well as metrics such as normalized mutual information (NMI) and adjusted Rand index (ARI). In the reference work, purity is reported as the percentage of samples whose cluster’s majority-vote label matches the true label.
Empirically, clusters reveal interpretable "styles": for example, the digit '1' clusters into thin-upright, slanted, and fat-base, and fashion "bags" cluster by handle type.
4. Decoder and Style Interpolation
The decoder 0 enables not only reconstruction but also style interpolation:
- Within-cluster interpolation: For embeddings 1 within the same cluster, convex combinations
2
decoded to 3 yield images retaining the same "style" due to local convexity.
- Between-cluster interpolation: Given 4 (cluster A) and 5 (cluster B) from the same class, interpolations 6 produce a smooth morph between distinct styles.
- Interpolation assessment: The quality of interpolated reconstructions may be quantified by
7
or visual inspection. Outputs exhibit sharp, semantically meaningful morphs, without the blurring typical in VAEs or GANs.
5. Experimental Results
Experiments were conducted on MNIST and Fashion-MNIST datasets, comparing SEQ to DEC, IDEC, DCEC, and CAE-8 baselines.
| Dataset | DEC | IDEC | DCEC | CAE-ℓ₂ | SEQ (k-means) |
|---|---|---|---|---|---|
| MNIST Purity | 86.55 | 88.06 | 88.97 | 95.11 | 99.74 (±0.046) |
Classification accuracy (test set) increases with cluster count 9: on MNIST, moving 0 from 10 to 120 raises accuracy from ~92% to ~99%, and on Fashion-MNIST from ~84% to ~91.8%. The encoder’s softmax head provides a performance upper bound (~99.4% MNIST, ~92.2% Fashion-MNIST). Deeper networks (LAE-4 vs. LAE-2) and convolutional encoders (CAE-4) further enhance results. SEQ demonstrates monotonic accuracy improvement up to 1 on MNIST.
Qualitative evaluations reveal that clusters correspond to visually and semantically distinct sub-styles within classes; interpolated samples remain sharp and do not display VAE-like or GAN-like blurring artifacts.
6. Advantages, Limitations, and Sensitivity
SEQ provides several benefits over classical supervised classifiers:
- Interpretability: Clusters correspond to discrete, visually meaningful style modes.
- Low Complexity Classification: Post-training, prediction reduces to nearest-centroid lookup.
- Fine-grained Generative Control: The architecture enables controlled style generation and morphing via convex interpolation.
Limitations include reliance on two-stage training—the quantizer is not inherently end-to-end differentiable. The choice of cluster count 2 is critical and impacts trade-offs between computational cost and cluster purity. There is no explicit regularization on inter-cluster distances, and, in a joint scheme, improper weighting can lead to collapsed or poorly separated clusters.
Ablation studies indicate that deeper and convolutional encoders yield higher accuracy and more meaningful style clusters, with cluster count 3 selected to maintain quantization accuracy 4 above the encoder’s baseline 5 minus a small tolerance 6.
7. Broader Implications and Open Questions
A plausible implication is that SEQ bridges the gap between interpretable symbolic representations and high-accuracy deep representations, especially in domains where style diversity or subclass semantics matter. Open questions concern seamless end-to-end training of quantizable representations (potentially via soft or straight-through assignment) and transferable adaptation of SEQ to other modalities and hierarchical clustering tasks. The need for principled choice of 7 and robust regularization of inter-cluster structure remains a central issue in extending the framework (Le et al., 2019).