Papers
Topics
Authors
Recent
2000 character limit reached

Hierarchical VQ-GAE: Discrete Graph Autoencoding

Updated 17 December 2025
  • The paper presents a hierarchical vector quantized graph autoencoder that uses a two-layer codebook and annealing-based selection to robustly encode both node features and graph topology.
  • The methodology combines a graph neural encoder with a hierarchical VQ module and dual decoders to simultaneously reconstruct node attributes and predict link probabilities.
  • Empirical results demonstrate superior performance in link prediction and node classification across benchmark datasets, outperforming 16 state-of-the-art self-supervised models.

Hierarchical Vector Quantized Graph Autoencoder (HQA-GAE) is a neural framework that integrates hierarchical vector quantization and annealed code selection into graph autoencoders to address critical limitations in prior self-supervised graph representation learning. It combines a graph neural encoder, hierarchical two-layer vector quantization, and a dual-decoder structure, yielding discrete latent codes that robustly capture both node features and graph topology. HQA-GAE specifically resolves challenges in codebook underutilization and codebook space sparsity, outperforming a broad array of state-of-the-art baselines in link prediction and node classification on benchmark graph datasets (Zeng et al., 17 Apr 2025).

1. Architectural Overview

HQA-GAE extends standard graph autoencoders by embedding a vector quantization module between the encoder and decoder, structured as a two-layer hierarchical codebook, and uses a temperature-annealed stochastic code selection strategy. The encoder uses any graph neural network (GNN)—such as GCN, GAT, or GraphSAGE—to map each node’s input feature xiRD\mathbf{x}_i\in\mathbb{R}^D to a continuous latent vector hi=E(xi)Rd\mathbf{h}_i=E(\mathbf{x}_i)\in\mathbb{R}^d. The vector quantization module contains:

  • First-layer codebook: {e1,j}j=1M\{\mathbf{e}_{1,j}\}_{j=1}^M
  • Second-layer codebook: {e2,k}k=1C\{\mathbf{e}_{2,k}\}_{k=1}^C, C<MC<M

For each node, hi\mathbf{h}_i is assigned to its nearest first-layer code e1,i\mathbf{e}_{1,i}; then, e1,i\mathbf{e}_{1,i} is mapped to its nearest center in the second-layer codebook e2,i\mathbf{e}_{2,i}. The node-feature decoder DnodeD_{\rm node} (a shallow GAT) reconstructs node features x^i\hat{\mathbf{x}}_i from e1,i\mathbf{e}_{1,i}, while the edge decoder Dedge(hi,hj)=σ(MLP(hihj))D_{\rm edge}(h_i, h_j)=\sigma(\mathrm{MLP}(h_i\circ h_j)) predicts link probabilities.

2. Vector Quantization Formalism

Letting hi=E(xi)Rd\mathbf{h}_i=E(\mathbf{x}_i)\in\mathbb{R}^d, the two-level vector quantization is defined as:

  • First-layer codebook: C1={e1,jRd}j=1M\mathcal{C}_1=\{\mathbf{e}_{1,j}\in\mathbb{R}^d\}_{j=1}^M
  • Second-layer codebook: C2={e2,kRd}k=1C\mathcal{C}_2=\{\mathbf{e}_{2,k}\in\mathbb{R}^d\}_{k=1}^C

Quantization uses squared Euclidean distance: d(z,e)=ze22d(z,e)=\|z-e\|_2^2

e1,i=argmineC1hie22,e2,i=argmineC2e1,ie22\mathbf{e}_{1,i} =\mathop{\arg\min}_{\mathbf{e}\in\mathcal{C}_1} \|\mathbf{h}_i-\mathbf{e}\|_2^2, \qquad \mathbf{e}_{2,i} =\mathop{\arg\min}_{\mathbf{e}\in\mathcal{C}_2} \|\mathbf{e}_{1,i}-\mathbf{e}\|_2^2

The node reconstruction is x^i=Dnode(e1,i)\hat{\mathbf{x}}_i = D_{\rm node}(\mathbf{e}_{1,i}). This quantization enforces discrete partitioning and compression of node embeddings, encouraging structured and interpretable latent clustering.

3. Annealing-Based Code-Selection Mechanism

Standard VQ assignment can incur "winner-take-all" pathologies where only a subset of codes are utilized, impairing codebook diversity. HQA-GAE addresses this with a temperature-controlled softmax selection over C1\mathcal{C}_1, defined by

si,j=hie1,j22s_{i,j} = -\|\mathbf{h}_i - \mathbf{e}_{1,j}\|_2^2

pi,j(t)=exp(si,j/τ(t))k=1Mexp(si,k/τ(t))p_{i,j}(t) = \frac{\exp(s_{i,j}/\tau(t))}{\sum_{k=1}^M \exp(s_{i,k}/\tau(t))}

with temperature τ(0)=τ0\tau(0)=\tau_0, and τ(t)=max(γτ(t1),ϵ)\tau(t)=\max(\gamma\,\tau(t-1),\epsilon), where γ\gamma is a decay factor and ϵ>0\epsilon>0 prevents premature sharpening.

Initially, τ\tau is large and code assignment is nearly uniform, promoting codebook exploration. As τ0\tau\to 0, assignment sharpens to the nearest code (argmin), focusing capacity onto the most salient codes. This adaptive annealing, implemented without extra loss functions, encourages broad codebook utilization in early epochs and specialization later.

4. Hierarchical Two-Layer Codebook

To alleviate codebook sparsity and encourage structured latent space organization, HQA-GAE introduces a two-layer codebook: the first contains a large set of codes, and the second clusters these codes into CC centers. For each second-layer center e2,k\mathbf{e}_{2,k}, a subset SkS_k of the first-layer codes is assigned: maxC2,S1,,SCk=1CjSke1,je2,k22\max_{\mathcal{C}_2,S_1,\dots,S_C} \sum_{k=1}^C \sum_{j\in S_k} -\|\mathbf{e}_{1,j} - \mathbf{e}_{2,k}\|_2^2 This enforces that similar first-layer codes share a second-layer ancestor, implicitly via a second-level VQ loss. The approach sharpens the clustering of latent embeddings and encourages structural regularities, so that nodes with shared attributes or topology yield proximate discrete representations.

5. Joint Loss Function and Optimization

The total loss combines reconstruction, edge prediction, and vector-quantization penalties:

L=LNodeRec+LEdgeRec+αLvq1+βLvq2\mathcal{L} = \mathcal{L}_{\rm NodeRec} + \mathcal{L}_{\rm EdgeRec} + \alpha\mathcal{L}_{\rm vq1} + \beta\mathcal{L}_{\rm vq2}

where

  • Node-feature loss (LNodeRec\mathcal{L}_{\rm NodeRec}): scaled cosine error, penalizing angle deviations in features.
  • Edge loss (LEdgeRec\mathcal{L}_{\rm EdgeRec}): negative sampling with MLP link predictor.
  • VQ losses (Lvq1\mathcal{L}_{\rm vq1}, Lvq2\mathcal{L}_{\rm vq2}): enforce commitment of encoder outputs to selected codes and update the codebook, with the stop-gradient operator ensuring valid optimization despite discrete lookup.

All parameters, including encoder, decoders, and codebooks, are learned via gradient descent, employing straight-through gradients for non-differentiable assignments.

6. Experimental Framework

Evaluation considers eight standard undirected, unweighted graphs from citation, co-purchase, co-author, and the OGB benchmark. Tasks include:

  • Link prediction: measured by AUC and AP on held-out edges, with dot-product probes.
  • Node classification: using a linear SVM classifier on the learned node embeddings, validated with 5-fold cross-validation.

A total of 16 self-supervised baselines are compared, including contrastive methods (DGI, GIC, GRACE, etc.) and autoencoding/masked models (GAE, VGAE, ARGA, Bandana, etc.). Experiments are conducted using PyTorch Geometric on NVIDIA A800 hardware with CUDA 12.1.

Dataset Type Example Datasets Task(s)
Citation Cora, CiteSeer, PubMed Link prediction, Node classification
Co-purchase Photo, Computers Link prediction, Node classification
Co-author CS, Physics Link prediction, Node classification
OGB ogbn-arxiv Link prediction, Node classification

7. Performance and Empirical Insights

HQA-GAE demonstrates leading performance across all major datasets and metrics. In link prediction (AUC ± SD on Cora), HQA-GAE achieves 96.02±0.1196.02\pm 0.11 with average rank 1.00, surpassing Bandana and MaskGAE. On the Photo and Computers graphs, the model exceeds the next best AP by approximately 20 percentage points. In node classification, HQA-GAE ranks best on six of eight datasets (average rank 1.25), with, for example, 88.78 on Cora and 88.49 on PubMed, compared to Bandana’s 88.59 and 88.16.

The architecture’s empirical strengths can be ascribed to:

  • Discrete compression via VQ: Facilitates encoding of salient structural graph signals rather than noise.
  • Annealing-based assignment: Mitigates codebook collapse, ensuring broader representation and improved generalization.
  • Hierarchical codebook clustering: Reduces code sparsity and creates more coherent, clusterable representations.
  • Dual reconstruction targets: By reconstructing both node features and graph links, the model jointly leverages topological and attribute information, in contrast to methods relying solely on perturbation-based contrastive objectives.

These structural elements yield robust, well-clustered embeddings and consistent improvements on self-supervised graph learning tasks (Zeng et al., 17 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Hierarchical VQ-GAE.