Molecular VQ-VAE for Discrete Representations

Updated 4 July 2026

The paper introduces a molecular VQ-VAE that replaces continuous embeddings with discrete codebooks, enabling high-fidelity reconstruction of molecular graphs.
It defines how quantization granularity—whether for whole graphs, individual nodes, or local 3D environments—affects representation fidelity and chemical interpretability.
Results demonstrate that discrete latent representations improve reconstruction and generative tasks, despite challenges like ordering dependence and codebook utilization.

Searching arXiv for the cited molecular VQ-VAE and related latent-representation papers to ground the article. VQ-VAE for molecule representation denotes a class of molecular autoencoding methods in which continuous encoder outputs are mapped to a finite codebook, producing discrete latent symbols that serve as molecular representations for reconstruction, generation, or downstream prediction. In recent arXiv work, this idea appears in graph-native form as node-aligned latent sequences for molecular graphs, in atom- and bond-level tokenizers for latent diffusion, and in atom-centered structural alphabets for 3D geometry (Zheng et al., 2 Dec 2025). Across these variants, the central technical question is not merely whether the latent space is discrete, but what molecular entity is quantized—whole graphs, nodes, bonds, or local geometric environments—and how much chemical and structural information survives that quantization (Noravesh et al., 1 May 2026).

1. Discrete molecular latents and their contrast with continuous VAEs

A molecular VQ-VAE replaces a continuous Gaussian bottleneck with a discrete codebook and nearest-neighbor assignment. In the graph setting, the encoder produces continuous node-level latents $\mathbf{Z}^e=\{\mathbf{z}_i^e\}_{i=1}^N$ , each latent is quantized by

$k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$

and the molecule is represented by the discrete index sequence

$K=(k_1,\dots,k_N).$

The corresponding training objective combines node and edge reconstruction with codebook and commitment terms,

$\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$

This is the clearest graph-native formulation of VQ-VAE for molecules in the supplied literature (Zheng et al., 2 Dec 2025).

The significance of this discrete bottleneck is best understood by contrast with several closely related but non-VQ molecular autoencoders. The graph-based model in "A Graph VAE and Graph Transformer Approach to Generating Molecular Graphs" uses a continuous Gaussian latent $z\in\mathbb{R}^c$ , trained with node reconstruction, KL regularization, an auxiliary property-prediction loss, and later an edge-reconstruction term; it explicitly has no discrete latent codes, no codebook embeddings, no nearest-neighbor vector quantization, no commitment loss, and no tokenized latent graph representation (Mitton et al., 2021). Likewise, the GRU-based SMILES VAE in "Re-balancing Variational Autoencoder Loss for Molecule Sequence Generation" uses a unit Gaussian prior, latent dimension $56$, and a re-balanced $\beta<1$ objective to counter posterior collapse, but remains a continuous-latent sequence model (Yan et al., 2019). The Transformer-VAE for unseen-structure generation also retains a Gaussian latent and uses $\beta=0.01$ to keep the latent informative rather than ignored by the decoder (Yoshikai et al., 2024).

These contrasts delimit the molecular VQ-VAE regime sharply. Discrete latents are not defined by graph inputs alone, nor by decoder outputs over categorical atoms or bonds, but by an explicit quantization step between encoder and decoder. This suggests that many papers relevant to molecular representation learning contribute architectural components or training diagnostics without themselves being VQ-VAEs.

2. What is quantized: graphs, graph elements, and local 3D environments

The most consequential design choice in molecular VQ-VAE systems is the granularity of quantization. Recent work spans three distinct representations: node-context embeddings in graphs, separate atom and bond embeddings, and atom-centered local 3D descriptors.

Paper	Quantized unit	Discrete representation
GVT (Zheng et al., 2 Dec 2025)	Ordered node-context embedding	Node-aligned code sequence $K=(k_1,\dots,k_N)$
VQ-SAD (Noravesh et al., 1 May 2026)	Atom embedding and bond embedding	Separate atom and bond code assignments
Mol-StrucTok (Gao et al., 2024)	Atom-centered 14D local 3D descriptor	Atom-level structural token

In "Graph VQ-Transformer (GVT)," each molecule is represented as a graph $\mathbf{G}=(\mathbf{X},\mathbf{E})$ , reordered by canonical Reverse Cuthill-McKee (RCM) before encoding. The encoder is a Graph Transformer over node features, edge features, Laplacian positional encodings, and adjacency. After the final encoder layer, node and edge information are fused by mean-aggregating incoming edge features and concatenating them with the final node embedding before a linear projection produces $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 0. Quantization is therefore per node, but each code represents a context-aware node state rather than an atom type alone (Zheng et al., 2 Dec 2025).

In "VQ-SAD: Vector Quantized Structure Aware Diffusion For Molecule Generation," the tokenizer acts on graph elements rather than whole-graph latents. Molecules are graphs $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 1, and the model learns separate atom and bond codebooks. Atoms and bonds are encoded independently, quantized by nearest-neighbor lookup, and decoded back to atom and bond representations. The resulting representation is not a graph-level bottleneck, not a subgraph vocabulary, and not a fingerprint; it is a tokenization of individual nodes and edges, later used as the discrete state space for a diffusion model (Noravesh et al., 1 May 2026).

In "Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates," the quantized object is neither a graph-level embedding nor a tokenized string fragment. Each atom is described by a 14-dimensional local structural descriptor,

$k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 2

where $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 3 encodes local spherical coordinates and $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 4 encodes bond lengths and pairwise bond angles to the four nearest neighbors. The VQ-VAE assigns each atom a structural code

$k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 5

turning local 3D environments into a discrete structural alphabet (Gao et al., 2024).

This variation in quantization unit determines what the learned code means. In GVT, the code is an abstract local graph feature aligned to an ordered atom position. In VQ-SAD, the code is a contextualized atom or bond state used as a tokenizer for diffusion. In Mol-StrucTok, the code is an atom-level local geometric token that augments a 2D line notation.

3. Core architectural patterns

Despite different representational units, recent molecular VQ-VAE systems share a common encoder–quantizer–decoder template and then diverge in how they make the discrete representation usable for downstream generative modeling.

GVT uses a Graph Transformer encoder and a Graph Transformer decoder. The encoder updates node and edge states jointly using graph structure and Laplacian positional encodings. The decoder applies Rotary Positional Embeddings (RoPE) to the quantized node states, predicts initial edge features from node pairs with a shared MLP, and then iteratively updates nodes and edges through Graph Transformer layers before reconstructing node features $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 6 and symmetrized adjacency/edge-type outputs $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 7. The latent sequence is variable-length, with $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 8, so molecular size maps directly to token-sequence length (Zheng et al., 2 Dec 2025).

VQ-SAD adopts a lighter tokenizer specification but makes a different design decision: the VQ-VAE is pretrained first and then frozen. Its atom path encodes each atom $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 9 to $K=(k_1,\dots,k_N).$ 0, quantizes by

$K=(k_1,\dots,k_N).$ 1

and decodes to $K=(k_1,\dots,k_N).$ 2; the bond path applies the analogous procedure to each bond. The losses are standard VQ-VAE terms plus a scaled cosine reconstruction term. For atoms,

$K=(k_1,\dots,k_N).$ 3

The same pattern is used for bonds, with a separate codebook. The frozen tokenizer is then treated as a fixed symbolic interface for the diffusion model (Noravesh et al., 1 May 2026).

Mol-StrucTok uses a deliberately compact tokenizer: a 3-layer MLP encoder, a 2-layer MLP decoder, hidden dimension $K=(k_1,\dots,k_N).$ 4, latent embedding dimension $K=(k_1,\dots,k_N).$ 5, structural vocabulary size $K=(k_1,\dots,k_N).$ 6, and about $K=(k_1,\dots,k_N).$ 7k parameters. Its VQ-VAE objective is the standard reconstruction-plus-commitment pattern,

$K=(k_1,\dots,k_N).$ 8

with the codebook loss replaced in implementation by exponential moving average updates following VQ-VAE-2 / Razavi et al. The code is then fused with atom identity into an expanded token vocabulary for GPT-2 or injected as an additional embedding in Graphormer (Gao et al., 2024).

A common misconception is that molecule representation with VQ-VAE necessarily means learning one discrete code per molecule. The supplied literature shows three other regimes: per-node quantization, per-atom/per-bond quantization, and per-atom local-geometry quantization. This suggests that code granularity is a first-order modeling choice rather than a secondary implementation detail.

4. Reconstruction fidelity, generation, and downstream use

The practical value of a molecular VQ-VAE depends on whether the discrete representation is faithful enough to reconstruct the molecule and regular enough to support a second-stage generator or predictor.

GVT makes reconstruction fidelity its primary representational claim. It describes its node-aligned discrete latents as "high-fidelity discrete latents" because they support near-perfect graph reconstruction. Reported 0-error reconstruction rates include DGAE at $K=(k_1,\dots,k_N).$ 9 on QM9 and $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 0 on ZINC250k, GVT without RoPE at $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 1 on QM9 and $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 2 on GuacaMol, and full GVT with RoPE at $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 3 on ZINC250k and $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 4 on GuacaMol. A decoder-only GPT-like Transformer is then trained over the latent sequence with

On ZINC250k, the paper reports Validity $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 6 in the table but later states $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 7, Uniqueness $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 8, and FCD $\begin{aligned} \mathcal{L}_{\text{VQ-VAE}} &= \lambda_{\text{node}} \mathcal{L}_{\text{node}} + \lambda_{\text{edge}} \mathcal{L}_{\text{edge}} \ &\quad + \|\mathrm{sg}(\mathbf{Z}^e) - \mathbf{Z}^q\|_2^2 + \beta \|\mathbf{Z}^e - \mathrm{sg}(\mathbf{Z}^q)\|_2^2 . \end{aligned}$ 9, explicitly noting the paper-text inconsistency for validity. On MOSES it reports Validity $z\in\mathbb{R}^c$ 0, Unique $z\in\mathbb{R}^c$ 1, Novel $z\in\mathbb{R}^c$ 2, Filters $z\in\mathbb{R}^c$ 3, FCD $z\in\mathbb{R}^c$ 4, SNN $z\in\mathbb{R}^c$ 5, and Scaf $z\in\mathbb{R}^c$ 6. On GuacaMol it reports Validity $z\in\mathbb{R}^c$ 7, Unique $z\in\mathbb{R}^c$ 8, Novel $z\in\mathbb{R}^c$ 9, KL div $56$0, and FCD $56$1. The paper interprets the comparatively weaker NSPDK on ZINC250k as evidence that the discrete sequence captures local chemistry better than global topology (Zheng et al., 2 Dec 2025).

Mol-StrucTok uses VQ-VAE not as a graph tokenizer but as a 3D structural alphabet. Its learned codes support both autoregressive 3D generation and property prediction. In generation, each atom token is paired with a structural token from the expanded vocabulary

$56$2

and GPT-2 is trained with

$56$3

The reported generation speed is $56$4 samples/s on one A100 with batch size $56$5, compared with $56$6 samples/s for EDM / GeoLDM, or about $56$7 faster. PoseBusters-style validity checks include All Atoms Connected $56$8, Reasonable Bond Angles $56$9, Reasonable Bond Lengths $\beta<1$ 0, Double Bond Flatness $\beta<1$ 1, Reasonable Internal Energy $\beta<1$ 2, and No Internal Steric Clash $\beta<1$ 3. For conditional generation on QM9 properties, reported MAEs include $\beta<1$ 4, $\beta<1$ 5, $\beta<1$ 6, $\beta<1$ 7, $\beta<1$ 8, and $\beta<1$ 9. In property prediction, adding Mol-StrucTok embeddings to Graphormer improves QM9 MAE for $\beta=0.01$ 0 from $\beta=0.01$ 1 to $\beta=0.01$ 2, $\beta=0.01$ 3 from $\beta=0.01$ 4 to $\beta=0.01$ 5, and $\beta=0.01$ 6 from $\beta=0.01$ 7 to $\beta=0.01$ 8 (Gao et al., 2024).

VQ-SAD places the tokenizer inside a neuro-symbolic diffusion framework. Its evidence for better representation is modest but consistent: on QM9 unconditional generation, SAD reports Valid $\beta=0.01$ 9, Uni $K=(k_1,\dots,k_N)$ 0, FCD $K=(k_1,\dots,k_N)$ 1, NSPDK $K=(k_1,\dots,k_N)$ 2, while VQ-SAD reports Valid $K=(k_1,\dots,k_N)$ 3, Uni $K=(k_1,\dots,k_N)$ 4, FCD $K=(k_1,\dots,k_N)$ 5, NSPDK $K=(k_1,\dots,k_N)$ 6. On ZINC250k, SAD reports Valid $K=(k_1,\dots,k_N)$ 7, Uni $K=(k_1,\dots,k_N)$ 8, FCD $K=(k_1,\dots,k_N)$ 9, NSPDK $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 0, while VQ-SAD reports Valid $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 1, Uni $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 2, FCD $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 3, NSPDK $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 4. The most representation-specific result is collision rate, defined by

$\mathbf{G}=(\mathbf{X},\mathbf{E})$ 5

with Table 4 reporting QM9: MELD $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 6 vs VQ-SAD $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 7, and ZINC250k: MELD $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 8 vs VQ-SAD $\mathbf{G}=(\mathbf{X},\mathbf{E})$ 9. The paper argues that the tokenizer reduces state clashing by preventing contextually different atoms from being treated as the same one-hot input state (Noravesh et al., 1 May 2026).

5. Continuous-latent baselines and transferable design ideas

Molecular VQ-VAE research in the supplied corpus is inseparable from continuous-latent baselines, because several design principles originate outside explicit quantization and remain directly transferable.

The 2021 graph VAE–Transformer model is not a VQ-VAE, but it contributes two ideas with clear relevance to discrete-latent design: graph-native encoding/decoding and property-aligned latent supervision. Molecules are represented directly as node-feature matrices $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 00 and edge-feature tensors $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 01, encoded by GraphSAGE plus DIFFPOOL into a continuous graph latent $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 02, with a property head $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 03. Edge generation is delegated to a position-free Transformer with a node encoding MLP and a valency mask. Because the model separates node reconstruction from edge prediction and supervises the latent space with molecular properties, it serves as a strong source of graph-VAE design ideas and latent-interpretability strategies, even though it contains no graph-level discrete codebook, no quantization, and no commitment objective (Mitton et al., 2021).

The sequence-based continuous VAEs clarify a second issue: latent utilization. "Re-balancing Variational Autoencoder Loss for Molecule Sequence Generation" argues that teacher forcing causes an underestimated reconstruction loss, making KL pressure relatively too strong and driving posterior collapse. The paper reports that with teacher forcing reconstruction loss quickly drops close to $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 04, while without teacher forcing it is at least $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 05. Under a vanilla annealed VAE, mutual information decreases to $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 06, reconstruction accuracy becomes close to $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 07, and validity becomes almost perfect; under the re-balanced objective with maximum $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 08, mutual information stays around $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 09, reconstruction accuracy reaches $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 10, and validity is $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 11 (Yan et al., 2019). Although a VQ-VAE does not use the same KL-to-prior mechanism, the paper explicitly frames codebook collapse or low code usage as the discrete-latent analogue of latent underutilization.

The Transformer-VAE for unseen structure generation contributes architectural techniques that are directly portable to sequence-based VQ-VAE design. It uses randomized SMILES as input and canonical SMILES as output, pools the Transformer encoder memory by concatenating a start-token state, mean pooling, and max pooling into a fixed latent representation, and conditions the decoder by adding the latent vector to every token embedding. It reports that latent dimensionality can be reduced to about $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 12 for MOSES and about $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 13 for ZINC-15 without major reconstruction loss, and uses the mean latent vector as a compact descriptor for XGBoost-based property prediction. This suggests that a discrete codebook model over SMILES can inherit not only the latent-variable objective, but also heteroencoding and decoder-conditioning mechanisms that force the representation to capture molecular identity rather than surface syntax (Yoshikai et al., 2024).

These continuous-latent papers correct a recurrent misunderstanding: replacing a Gaussian latent with a discrete codebook does not by itself guarantee informative molecular representations. Decoder strength, teacher forcing, latent conditioning, and auxiliary property supervision remain central.

6. Limitations, unresolved issues, and scope of the field

The recent molecular VQ-VAE literature establishes feasibility rather than closure. Several limitations recur across papers, and they define the current research frontier.

A first limitation is dependence on ordering or serialization. GVT relies heavily on canonical RCM node ordering and further shows that ordering alone is insufficient: without RoPE, reconstruction collapses on complex datasets, with GuacaMol reconstruction changing from $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 14 without RoPE to $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 15 with it. The latent sequence is therefore not intrinsically permutation-invariant; it becomes useful only after a chosen graph serialization is made decodable by architecture (Zheng et al., 2 Dec 2025). Mol-StrucTok similarly depends on a sequence traversal and a reference-selection procedure; the paper explicitly shows that 2D topology-based reference selection is more robust than 1D or 3D alternatives under descriptor noise, with RMSD $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 16Å rates at noise scale $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 17 of $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 18 for 1D-based, $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 19 for 2D-based, and $k_i = \arg\min_k \|\mathbf{z}_i^e - \mathbf{c}_k\|_2^2,$ 20 for 3D-based selection (Gao et al., 2024).

A second limitation is the incompleteness of representation diagnostics. GVT does not provide a deeper codebook usage analysis such as perplexity, dead-code statistics, or formal semantic interpretation of code entries. VQ-SAD does not report tokenizer reconstruction accuracy, codebook utilization statistics, perplexity, or sensitivity to codebook size and embedding dimension; it also does not explicitly discuss codebook collapse, dead entries, EMA updates, or usage-balancing regularization. The paper claims that the large discrete code space yields more balanced atom and bond types, but no explicit code-frequency histogram or entropy analysis is reported (Noravesh et al., 1 May 2026). This suggests that current molecular VQ-VAE results are stronger on downstream generation metrics than on mechanistic understanding of the discrete vocabulary itself.

A third limitation concerns the level of chemical abstraction captured by the codes. GVT’s codes are abstract local graph features rather than explicit motifs; its oxygen example shows that several chemically similar oxygens can share the same code, while disambiguation is deferred to decoder order and attention. VQ-SAD quantizes atoms and bonds rather than explicit motifs or functional groups, so its rhetoric about a symbolic vocabulary should be interpreted as contextualized graph-element states rather than fragment-level tokens. Mol-StrucTok’s tokens are local 3D environments, not long-range conformational states; fine geometry is discretized approximately, and coordinate recovery still requires sequential reconstruction plus topology-aware optimization (Zheng et al., 2 Dec 2025).

Finally, not every bottlenecked molecular autoencoder belongs to the VQ-VAE family. "MolQAE: Quantum Autoencoder for Molecular Representation Learning" uses a quantum bottleneck over amplitude-encoded SMILES token-frequency distributions, but it has no discrete codebook, no nearest-neighbor lookup, and no commitment loss. It is therefore a useful contrast case: molecular compression can be achieved by other bottleneck mechanisms, but those mechanisms should not be conflated with vector quantization (Pan et al., 3 May 2025).

Taken together, the recent record defines VQ-VAE for molecule representation as a family of discrete-latent molecular tokenizers rather than a single canonical architecture. Graph-native node tokenizers, atom–bond tokenizers, and local 3D structural alphabets all satisfy that description, but they optimize different trade-offs among fidelity, locality, ordering dependence, downstream model compatibility, and chemical interpretability. The strongest current results come from settings where the discrete representation is nearly lossless enough to support a powerful second-stage model, yet the main unresolved issue remains whether those codes can be made simultaneously faithful, chemically interpretable, and robustly utilized.