Graph VQ-Transformer: Hierarchical VQ for Graphs

Updated 17 December 2025

The paper introduces a hierarchical vector quantization layer in graph autoencoders to compress and discretize node representations effectively.
It employs an annealing-based softmax sampling strategy to improve codebook utilization, prevent code collapse, and enhance topology recovery.
Empirical results across diverse benchmarks demonstrate superior performance in link prediction and node classification compared to established baselines.

Graph VQ-Transformer refers to a class of graph autoencoder models that incorporate vector quantization (VQ) mechanisms to discretize node or subgraph representations. The approach aims to capture complex graph topologies in a compressed, discrete latent space. The "Hierarchical Vector Quantized Graph Autoencoder" (HQA-GAE) is a recently introduced framework that systematically inserts a vector quantization layer with a hierarchical codebook and annealing-based code selection between the encoder and decoders in a standard graph autoencoder pipeline. This architecture, focused on self-supervised learning on graphs, directly addresses limitations in prior work such as codebook underutilization and space sparsity, and demonstrates superior performance across multiple graph learning benchmarks (Zeng et al., 17 Apr 2025).

1. Model Architecture

The HQA-GAE framework augments the standard Graph Autoencoder (GAE) by integrating a hierarchical vector quantization layer. The architecture is as follows:

Input: Graph $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ with node features $\{\mathbf{x}_i\}_{i=1}^N$ .
Encoder ( $E$ ): Utilizes any Graph Neural Network (GNN) (e.g., GCN, GraphSAGE, GAT) to map features $\mathbf{x}_i$ to embeddings $\mathbf{h}_i\in\mathbb{R}^d$ .
VQ Layer: Implements a two-layer codebook. Each $\mathbf{h}_i$ is quantized to its nearest vector $\mathbf{e}_{1,i}$ in the first codebook; subsequently, each $\mathbf{e}_{1,i}$ is mapped to a second-layer meta-code $\mathbf{e}_{2,i}$ .
Node Decoder ( $D_{\mathrm{node}}$ ): Reconstructs node features from $\mathbf{e}_{1,i}$ , typically via a GAT decoder.
Edge Decoder ( $D_{\mathrm{edge}}$ ): Reconstructs adjacency values from $\mathbf{h}_i$ , using an MLP and inner product.

This hierarchical quantization allows the model to exploit both fine and coarse code structures, promoting effective compression and representation of graph topologies.

2. Vector Quantization and Hierarchical Codebook Design

Vector quantization in HQA-GAE goes beyond traditional VQ-VAE by introducing two levels of discrete encoding:

First-Layer Codebook: A set $\{\mathbf{e}_{1,k}\}_{k=1}^M$ compresses $N$ node embeddings down to $M\ll N$ codewords. Nodes with similar code assignments are forced to differentiate during reconstruction, sharpening topological expressivity.
Second-Layer (Meta-Code) Codebook: Comprising $C<M$ meta-codes $\{\mathbf{e}_{2,\ell}\}_{\ell=1}^C$ , this layer clusters first-layer codes, capturing semantic or structural relationships between codewords. The clustering objective is

$\max_{\{\mathbf{e}_{2,j}\},\,\{S_j\}} \sum_{j=1}^C\sum_{i\in S_j}\mathrm{sim}(\mathbf{e}_{1,i},\mathbf{e}_{2,j}),$

analogous to k-means assignment, and implemented by assignment-update alternation and straight-through gradients.

This hierarchical structure addresses codebook sparsity by regularizing code relationships and providing a coarse structural prior on the discrete latent space.

3. Annealing-Based Code Selection Strategy

Standard VQ assigns codes via a $\arg\max$ over code similarities, leading to "winner-take-all" utilization and significant codebook underutilization. HQA-GAE introduces annealing-based softmax sampling:

Softmax Sampling: For embedding $\mathbf{h}_i$ , the code selection probability is defined as

$p_{i,j}(t) = \frac{\exp(s_{i,j}/T(t))}{\sum_{k=1}^M \exp(s_{i,k}/T(t))},$

where $s_{i,j} = \mathrm{sim}(\mathbf{h}_i, \mathbf{e}_{1,j})$ , and $T(t)$ is the temperature at epoch $t$ .

Annealing Schedule: The temperature decays via $T(t) = \max(\gamma T(t-1), \epsilon)$ , with $0<\gamma<1$ and $\epsilon>0$ . Early high temperature ensures uniform exploration and broad code utilization; later low temperature sharpens to near-deterministic selections.

This scheme improves codebook utilization, alleviates code collapse, and optimizes downstream graph representation quality.

4. Mathematical Objectives and Training

HQA-GAE jointly optimizes several loss functions:

Loss Term	Purpose	Formula (abridged)
$\mathcal{L}_{\textrm{NodeRec}}$	Node feature reconstruction	Cosine-based: $1 - \frac{\mathbf{x}_i^\top \hat{\mathbf{x}}_i}{\\|\mathbf{x}_i\\|\\|\hat{\mathbf{x}}_i\\|}$
$\mathcal{L}_{\textrm{EdgeRec}}$	Edge (adjacency) reconstruction	Negative sampling on inner products $\mathbf{h}_i^\top \mathbf{h}_j$
$\mathcal{L}_{\textrm{vq1}}$	First-layer VQ commitment/assignment	$\|\|\textrm{sg}[\mathbf{e}_{1,i}] - \mathbf{h}_i\|\|_2^2 + \|\|\textrm{sg}[\mathbf{h}_i] - \mathbf{e}_{1,i}\|\|_2^2$
$\mathcal{L}_{\textrm{vq2}}$	Second-layer VQ (meta-clustering)	$\|\|\textrm{sg}[\mathbf{e}_{2,i}] - \mathbf{e}_{1,i}\|\|_2^2 + \|\|\textrm{sg}[\mathbf{e}_{1,i}] - \mathbf{e}_{2,i}\|\|_2^2$
$\mathcal{L}$	Total loss	$\mathcal{L}_{\textrm{NodeRec}} + \mathcal{L}_{\textrm{EdgeRec}} + \alpha\mathcal{L}_{\textrm{vq1}} + \beta\mathcal{L}_{\textrm{vq2}}$

Hyperparameters $\alpha$ and $\beta$ control the weighting of the VQ losses. Training follows a standard stochastic gradient descent regime with codebook updates and temperature annealing applied at each epoch.

5. Benchmarking, Empirical Results, and Ablation Analyses

HQA-GAE was evaluated on eight datasets, including citation networks (Cora, CiteSeer, PubMed), co-purchase graphs (Computers, Photo), co-author networks (CS, Physics), and the ogbn-arxiv benchmark. The evaluation tasks are self-supervised link prediction (using AUC, AP metrics) and node classification (downstream SVM with 5-fold cross-validation). The model outperformed 16 representative baselines—spanning both contrastive (DGI, GIC, GRACE, GCA, MVGRL, BGRL) and autoencoding (GAE, VGAE, ARGA, ARVGA, SeeGera, GraphMAE, GraphMAE2, MaskGAE, S2GAE, Bandana) approaches.

Key findings include:

Link Prediction: Achieved best AUC/AP on all datasets, with average rank 1.00 (0.5–20% higher absolute scores than next-best).
Node Classification: Superior performance in 6/8 datasets, average rank 1.25.
Ablation: Annealing parameter $\gamma$ controls code utilization, with maximum accuracy observed at $\gamma \approx 0.9$ . Hierarchical codebooks (two-layer vs single-layer) showed improved clustering metrics (NMI, ARI, SC) on 7/8 datasets. t-SNE visualizations displayed improved class separation and meta-center formation for node embeddings.

6. Significance and Methodological Implications

HQA-GAE demonstrates that integrating VQ layers with hierarchical organization and annealed code selection effectively overcomes underutilization and sparsity challenges in discrete graph representation learning. The two-layer codebook not only provides compression but also encodes meta-relationships, leading to increased clustering coherence and more informative discrete embeddings. The annealing-based policy for code selection ensures robust exploration of the code space and prevents early specialization that could harm subsequent learning. These innovations suggest broader potential for discrete latent modeling in graph SSL beyond conventional perturbation-based contrastive methods. The approach provides a new paradigm for topological embedding compression in large and heterogeneous graphs (Zeng et al., 17 Apr 2025).

7. Connections to Existing Work and Outlook

Conventional graph self-supervised learning often relies on contrastive objectives with artificially perturbed graphs—a strategy that may corrupt intrinsic information. HQA-GAE, by eliminating reliance on perturbations and encoding structure through hierarchical VQ, enables learning representations that better reflect underlying graph topology. The results establish a precedent for discrete, cluster-aware latent spaces in graph representation learning. A plausible implication is the applicability of similar hierarchical quantization principles in multi-modal or multi-scale graph architectures, and as a foundation for resource-efficient, scalable graph modeling.

PDF Markdown Chat (Pro)

References (1)

Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Graph VQ-Transformer.