Graph VQ-Transformer: Hierarchical VQ for Graphs
- The paper introduces a hierarchical vector quantization layer in graph autoencoders to compress and discretize node representations effectively.
- It employs an annealing-based softmax sampling strategy to improve codebook utilization, prevent code collapse, and enhance topology recovery.
- Empirical results across diverse benchmarks demonstrate superior performance in link prediction and node classification compared to established baselines.
Graph VQ-Transformer refers to a class of graph autoencoder models that incorporate vector quantization (VQ) mechanisms to discretize node or subgraph representations. The approach aims to capture complex graph topologies in a compressed, discrete latent space. The "Hierarchical Vector Quantized Graph Autoencoder" (HQA-GAE) is a recently introduced framework that systematically inserts a vector quantization layer with a hierarchical codebook and annealing-based code selection between the encoder and decoders in a standard graph autoencoder pipeline. This architecture, focused on self-supervised learning on graphs, directly addresses limitations in prior work such as codebook underutilization and space sparsity, and demonstrates superior performance across multiple graph learning benchmarks (Zeng et al., 17 Apr 2025).
1. Model Architecture
The HQA-GAE framework augments the standard Graph Autoencoder (GAE) by integrating a hierarchical vector quantization layer. The architecture is as follows:
- Input: Graph with node features .
- Encoder (): Utilizes any Graph Neural Network (GNN) (e.g., GCN, GraphSAGE, GAT) to map features to embeddings .
- VQ Layer: Implements a two-layer codebook. Each is quantized to its nearest vector in the first codebook; subsequently, each is mapped to a second-layer meta-code .
- Node Decoder (): Reconstructs node features from , typically via a GAT decoder.
- Edge Decoder (): Reconstructs adjacency values from , using an MLP and inner product.
This hierarchical quantization allows the model to exploit both fine and coarse code structures, promoting effective compression and representation of graph topologies.
2. Vector Quantization and Hierarchical Codebook Design
Vector quantization in HQA-GAE goes beyond traditional VQ-VAE by introducing two levels of discrete encoding:
- First-Layer Codebook: A set compresses node embeddings down to codewords. Nodes with similar code assignments are forced to differentiate during reconstruction, sharpening topological expressivity.
- Second-Layer (Meta-Code) Codebook: Comprising meta-codes , this layer clusters first-layer codes, capturing semantic or structural relationships between codewords. The clustering objective is
analogous to k-means assignment, and implemented by assignment-update alternation and straight-through gradients.
This hierarchical structure addresses codebook sparsity by regularizing code relationships and providing a coarse structural prior on the discrete latent space.
3. Annealing-Based Code Selection Strategy
Standard VQ assigns codes via a over code similarities, leading to "winner-take-all" utilization and significant codebook underutilization. HQA-GAE introduces annealing-based softmax sampling:
- Softmax Sampling: For embedding , the code selection probability is defined as
where , and is the temperature at epoch .
- Annealing Schedule: The temperature decays via , with and . Early high temperature ensures uniform exploration and broad code utilization; later low temperature sharpens to near-deterministic selections.
This scheme improves codebook utilization, alleviates code collapse, and optimizes downstream graph representation quality.
4. Mathematical Objectives and Training
HQA-GAE jointly optimizes several loss functions:
| Loss Term | Purpose | Formula (abridged) |
|---|---|---|
| Node feature reconstruction | Cosine-based: | |
| Edge (adjacency) reconstruction | Negative sampling on inner products | |
| First-layer VQ commitment/assignment | ||
| Second-layer VQ (meta-clustering) | ||
| Total loss |
Hyperparameters and control the weighting of the VQ losses. Training follows a standard stochastic gradient descent regime with codebook updates and temperature annealing applied at each epoch.
5. Benchmarking, Empirical Results, and Ablation Analyses
HQA-GAE was evaluated on eight datasets, including citation networks (Cora, CiteSeer, PubMed), co-purchase graphs (Computers, Photo), co-author networks (CS, Physics), and the ogbn-arxiv benchmark. The evaluation tasks are self-supervised link prediction (using AUC, AP metrics) and node classification (downstream SVM with 5-fold cross-validation). The model outperformed 16 representative baselines—spanning both contrastive (DGI, GIC, GRACE, GCA, MVGRL, BGRL) and autoencoding (GAE, VGAE, ARGA, ARVGA, SeeGera, GraphMAE, GraphMAE2, MaskGAE, S2GAE, Bandana) approaches.
Key findings include:
- Link Prediction: Achieved best AUC/AP on all datasets, with average rank 1.00 (0.5–20% higher absolute scores than next-best).
- Node Classification: Superior performance in 6/8 datasets, average rank 1.25.
- Ablation: Annealing parameter controls code utilization, with maximum accuracy observed at . Hierarchical codebooks (two-layer vs single-layer) showed improved clustering metrics (NMI, ARI, SC) on 7/8 datasets. t-SNE visualizations displayed improved class separation and meta-center formation for node embeddings.
6. Significance and Methodological Implications
HQA-GAE demonstrates that integrating VQ layers with hierarchical organization and annealed code selection effectively overcomes underutilization and sparsity challenges in discrete graph representation learning. The two-layer codebook not only provides compression but also encodes meta-relationships, leading to increased clustering coherence and more informative discrete embeddings. The annealing-based policy for code selection ensures robust exploration of the code space and prevents early specialization that could harm subsequent learning. These innovations suggest broader potential for discrete latent modeling in graph SSL beyond conventional perturbation-based contrastive methods. The approach provides a new paradigm for topological embedding compression in large and heterogeneous graphs (Zeng et al., 17 Apr 2025).
7. Connections to Existing Work and Outlook
Conventional graph self-supervised learning often relies on contrastive objectives with artificially perturbed graphs—a strategy that may corrupt intrinsic information. HQA-GAE, by eliminating reliance on perturbations and encoding structure through hierarchical VQ, enables learning representations that better reflect underlying graph topology. The results establish a precedent for discrete, cluster-aware latent spaces in graph representation learning. A plausible implication is the applicability of similar hierarchical quantization principles in multi-modal or multi-scale graph architectures, and as a foundation for resource-efficient, scalable graph modeling.