Papers
Topics
Authors
Recent
Search
2000 character limit reached

Supervised Encoding Quantizers (SEQ)

Updated 26 May 2026
  • SEQ is a supervised learning framework that integrates discrete representation learning and interpretable clustering to yield semigenerative, graph-structured latent spaces.
  • Its modular design—comprising an encoder, a k-means based quantizer, and a decoder—facilitates high cluster purity and controllable style interpolation.
  • The approach offers improved interpretability and low-complexity classification but requires careful tuning of the cluster count to balance performance and computational cost.

Supervised Encoding Quantizers (SEQ) constitute a supervised learning framework that combines discrete representation learning with interpretable clustering and semigenerative capabilities. SEQ departs from the classical paradigm, where features are directly mapped to label probabilities, by explicitly clustering encoded features and leveraging quantization to yield both interpretable graph-structured representations and controllable style interpolation. Discrete cluster assignments correspond to "styles" or sub-classes, providing semantic structure and transparency to the latent space. The approach was introduced and developed in "Supervised Encoding for Discrete Representation Learning" (Le et al., 2019).

1. Model Architecture and Key Components

SEQ comprises three core modules: encoder, quantizer, and decoder, supported optionally by a classification head.

  • Encoder fϕf_\phi: A feed-forward or convolutional neural network mapping each input xRDx \in \mathbb{R}^D to an embedding z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d. Typical architectures include MLPs of depth 2 or 4 ("LAE-2", "LAE-4") or a convolutional frontend plus MLP ("CAE-4"). Initial encoder training uses a conventional cross-entropy objective with a softmax classification layer WW attached.
  • Quantizer QCQ_C: Once encoder parameters ϕ\phi are fixed, all embeddings {zi}\{z_i\} are clustered using kk-means, yielding cluster centers C={c1,,cK}C = \{c_1, \ldots, c_K\}. Quantization is performed by hard-assignment:

QC(z)=argmink=1Kzck22Q_C(z) = \arg\min_{k=1\cdots K} \|z - c_k\|_2^2

The cluster assignment can be represented by the one-hot vector xRDx \in \mathbb{R}^D0.

  • Decoder xRDx \in \mathbb{R}^D1: A symmetric network (MLP or deconv+MLP) that reconstructs the input from latent embeddings or cluster centers: xRDx \in \mathbb{R}^D2. Decoder training minimizes mean squared error (MSE) between xRDx \in \mathbb{R}^D3 and xRDx \in \mathbb{R}^D4.
  • Style-mixing via Convex Combination: Embeddings xRDx \in \mathbb{R}^D5 from different clusters (or the same) can be linearly interpolated as xRDx \in \mathbb{R}^D6 with xRDx \in \mathbb{R}^D7, enabling smooth traversal and "style transfer" in latent space.

2. Training Workflow and Losses

SEQ employs a three-stage training protocol, with the option for joint optimization.

  1. Encoder Pre-training: Encoder xRDx \in \mathbb{R}^D8 (and xRDx \in \mathbb{R}^D9) are optimized via the standard classification loss:

z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d0

where z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d1.

  1. Quantizer Fitting: Fix z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d2, compute z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d3. Run z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d4-means to minimize:

z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d5

Assign each z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d6 to its nearest cluster z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d7.

  1. Decoder Training: With z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d8 fixed, train z=fϕ(x)Rdz = f_\phi(x) \in \mathbb{R}^d9 by minimizing reconstruction loss:

WW0

or, for strictly quantized codes, using WW1.

Optionally, a classification head WW2 can be trained on the one-hot quantized codes with a loss:

WW3

A joint end-to-end objective is possible:

WW4

with hyperparameters WW5 controlling trade-offs. In practice, the original scheme sets WW6 and optimizes each stage sequentially.

SEQ Training Pipeline (Pseudocode)

ϕ\phi8

3. Clustering Structure and Interpretability

SEQ’s quantizer induces a discrete cluster graph in latent space where each codebook center WW7 forms a node. Edges may be constructed between nodes whose centers are within a threshold: WW8. Due to the supervised nature of training, each cluster node predominantly contains samples of a single class, and often, distinct clusters correspond to sub-styles or sub-classes within a label.

Cluster quality and alignment with semantic labels are assessed using clustering purity

WW9

as well as metrics such as normalized mutual information (NMI) and adjusted Rand index (ARI). In the reference work, purity is reported as the percentage of samples whose cluster’s majority-vote label matches the true label.

Empirically, clusters reveal interpretable "styles": for example, the digit '1' clusters into thin-upright, slanted, and fat-base, and fashion "bags" cluster by handle type.

4. Decoder and Style Interpolation

The decoder QCQ_C0 enables not only reconstruction but also style interpolation:

  • Within-cluster interpolation: For embeddings QCQ_C1 within the same cluster, convex combinations

QCQ_C2

decoded to QCQ_C3 yield images retaining the same "style" due to local convexity.

  • Between-cluster interpolation: Given QCQ_C4 (cluster A) and QCQ_C5 (cluster B) from the same class, interpolations QCQ_C6 produce a smooth morph between distinct styles.
  • Interpolation assessment: The quality of interpolated reconstructions may be quantified by

QCQ_C7

or visual inspection. Outputs exhibit sharp, semantically meaningful morphs, without the blurring typical in VAEs or GANs.

5. Experimental Results

Experiments were conducted on MNIST and Fashion-MNIST datasets, comparing SEQ to DEC, IDEC, DCEC, and CAE-QCQ_C8 baselines.

Dataset DEC IDEC DCEC CAE-ℓ₂ SEQ (k-means)
MNIST Purity 86.55 88.06 88.97 95.11 99.74 (±0.046)

Classification accuracy (test set) increases with cluster count QCQ_C9: on MNIST, moving ϕ\phi0 from 10 to 120 raises accuracy from ~92% to ~99%, and on Fashion-MNIST from ~84% to ~91.8%. The encoder’s softmax head provides a performance upper bound (~99.4% MNIST, ~92.2% Fashion-MNIST). Deeper networks (LAE-4 vs. LAE-2) and convolutional encoders (CAE-4) further enhance results. SEQ demonstrates monotonic accuracy improvement up to ϕ\phi1 on MNIST.

Qualitative evaluations reveal that clusters correspond to visually and semantically distinct sub-styles within classes; interpolated samples remain sharp and do not display VAE-like or GAN-like blurring artifacts.

6. Advantages, Limitations, and Sensitivity

SEQ provides several benefits over classical supervised classifiers:

  • Interpretability: Clusters correspond to discrete, visually meaningful style modes.
  • Low Complexity Classification: Post-training, prediction reduces to nearest-centroid lookup.
  • Fine-grained Generative Control: The architecture enables controlled style generation and morphing via convex interpolation.

Limitations include reliance on two-stage training—the quantizer is not inherently end-to-end differentiable. The choice of cluster count ϕ\phi2 is critical and impacts trade-offs between computational cost and cluster purity. There is no explicit regularization on inter-cluster distances, and, in a joint scheme, improper weighting can lead to collapsed or poorly separated clusters.

Ablation studies indicate that deeper and convolutional encoders yield higher accuracy and more meaningful style clusters, with cluster count ϕ\phi3 selected to maintain quantization accuracy ϕ\phi4 above the encoder’s baseline ϕ\phi5 minus a small tolerance ϕ\phi6.

7. Broader Implications and Open Questions

A plausible implication is that SEQ bridges the gap between interpretable symbolic representations and high-accuracy deep representations, especially in domains where style diversity or subclass semantics matter. Open questions concern seamless end-to-end training of quantizable representations (potentially via soft or straight-through assignment) and transferable adaptation of SEQ to other modalities and hierarchical clustering tasks. The need for principled choice of ϕ\phi7 and robust regularization of inter-cluster structure remains a central issue in extending the framework (Le et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Supervised Encoding Quantizers (SEQ).