HAT-GAE: Hierarchical Adaptive Masking & Corruption
- The paper introduces a novel self-supervised graph autoencoder that integrates hierarchical adaptive masking and trainable corruption to enhance feature reconstruction.
- HAT-GAE employs an iterative masking mechanism based on node and feature importance, paired with learnable noise injection to create robust representations.
- Extensive evaluations on transductive and inductive benchmarks demonstrate HAT-GAE’s superior performance over existing graph representation learning methods.
Hierarchical Adaptive Masking and Corruption (HAT-GAE) is a self-supervised generative graph auto-encoder architecture designed to enhance representation learning for graph-structured data. The model advances over prior self-supervised graph auto-encoders by incorporating a hierarchical adaptive masking mechanism, which incrementally increases the training difficulty, and a trainable corruption scheme, which enables the model to learn robust representations by undoing adaptively learned noise. HAT-GAE achieves leading performance across multiple transductive and inductive node classification benchmarks, demonstrating the effectiveness of its component innovations (Sun, 2023).
1. Architectural Overview
HAT-GAE consists of five principal modules: Adaptive and Hierarchical Masking, Trainable Corruption, a Graph Neural Network (GNN) Encoder, Masked Hidden Representation, and a Decoder with Feature Reconstruction. The model operates over the original graph , where is the adjacency matrix and is the node feature matrix. The training pipeline for each epoch performs the following steps:
- Hierarchical adaptive masking is applied to to produce , iteratively increasing masking at scheduled intervals.
- Trainable corruption introduces learnable noise to a subset of node features, selected by a Bernoulli mask , yielding corrupted features .
- The encoder (multi-head GAT) computes hidden states for 0.
- Hidden states of “noisy” nodes in 1 are zeroed-out, forming 2.
- The decoder 3 (GAT-based) reconstructs node features 4 from 5.
- Only the features corresponding to corrupted nodes are reconstructed via a cosine-similarity-based loss.
This configuration allows the model to focus learning capacity on features and nodes most relevant for robust recovery of meaningful representations.
2. Hierarchical Adaptive Masking Mechanism
The hierarchical adaptive masking strategy aims to simulate progressive curriculum learning by dynamically increasing feature masking difficulty during training. Masking is performed along feature dimensions, guided by quantifiable importance scores.
2.1 Node and Dimension Importance
Node importance 6 is computed, by default, as the in-degree:
7
Alternative importance metrics, such as eigenvector centrality or PageRank, are supported but not the default.
Feature dimension importance 8 is then aggregated as:
9
These scores are sorted in descending order, with less informative dimensions masked earlier in training.
2.2 Adaptive and Hierarchical Scheduling
In each adaptive masking step, a fraction of the lowest-scored dimensions for every node is zeroed, with the masking rate 0 controlling the masked proportion:
1
where 2 masks the chosen dimensions (Eq. 4).
The masking schedule is governed by the number of rounds 3 and total epochs 4, with re-masking occurring every 5 epochs. The dimensionality masked at each round 6 is recursively decreased to avoid masking all features at once:
7
Through this procedure, the model incrementally increases task difficulty, first challenging the network with less critical features and progressing towards increasingly difficult signal recovery.
3. Trainable Corruption Scheme
Unlike models relying on fixed or random feature corruption, HAT-GAE introduces a corruption process with a learnable noise component.
3.1 Bernoulli Mask Sampling
For each node-feature pair 8, a binary mask 9 is sampled as:
0
where 1 is the noisy node rate specifying the expected fraction of entries corrupted per epoch.
3.2 Learnable Noise Injection
A trainable noise parameter 2 is combined with the mask to produce corrupted features:
3
4 is optimized jointly with the encoder and decoder through the self-supervised reconstruction loss, with no additional regularization. This challenges the auto-encoder to become robust to adversarially-learned, rather than random, perturbations.
4. Self-Supervised Optimization Objective
After reconstruction, the model’s objective is to accurately recover only the corrupted node features. For nodes 5, the cosine similarity between true and reconstructed features is:
6
The loss is:
7
No auxiliary contrastive or adversarial losses are introduced. This singular focus facilitates computational efficiency while still providing strong training signal.
5. Implementation Specifications
HAT-GAE is implemented with a 2-layer Graph Attention Network for both encoder and decoder, using four attention heads per layer and a PReLU activation. Hidden dimension per node is set between 256 and 1024 depending on the dataset. The Adam optimizer initializes with a learning rate of 0.001, dataset-dependent weight decay, no warm-up, and a learning rate decay schedule. Hyperparameter choices include adaptive mask rates 8, noise rates 9 in 0, and number of hierarchical rounds 1 approximately 2–3. Training durations range from 500 to 2000 epochs, implemented in PyTorch 1.9.1 with DGL 0.8.2, and trained on Tesla V100 GPUs.
6. Experimental Results and Analysis
HAT-GAE was evaluated using linear probing on ten standard benchmarks: eight transductive datasets (e.g., Cora, Citeseer, Pubmed, Amazon-Photo, Amazon-Computer, Coauthor-CS, Coauthor-Physics, OGBN-arXiv) and two inductive datasets (Reddit, PPI).
6.1 Transductive Node Classification
The model achieved the highest unsupervised accuracy on 7 of 8 datasets, outperforming both contrastive (DGI, GRACE, MVGRL, BGRL, InfoGCL, CCA-SSG) and generative (GAE, GPT-GNN, GATE, GraphMAE) baselines. Representative scores for selected benchmarks:
| Dataset | HAT-GAE | Best Baseline | Baseline Name |
|---|---|---|---|
| Cora | 84.78 | 84.19 | GraphMAE |
| Citeseer | 74.28 | 73.41 | GraphMAE |
| Pubmed | 81.88 | 81.21 | GraphMAE |
| Amazon-Photo | 93.58 | 93.01 | GraphMAE |
| Amazon-Computer | 88.55 | 88.32 | GraphMAE |
| Coauthor-CS | 93.17 | 92.79 | GraphMAE |
| Coauthor-Physics | 95.57 | 95.30 | GraphMAE |
| OGBN-arXiv | 71.99 | 71.59 | GraphMAE |
6.2 Inductive Node Classification
On Reddit, HAT-GAE attained 96.06 micro-F1 (vs. 95.89 for GraphMAE). On PPI, HAT-GAE scored 74.72 (vs. 74.39 for GraphMAE).
6.3 Ablation and Sensitivity Studies
Ablation experiments demonstrate that each component—random masking, single adaptive mask, and omission of trainable corruption—results in notable declines of 0.7–2.7% absolute performance, validating the contributions of hierarchical masking and trainable corruption. Sensitivity to 4 and 5 is moderate for values below 6; higher rates cause sharp accuracy degradation due to over-masking/noising. Optimal performance is achieved with 7–8 masking rounds on datasets such as Cora when training for 9 epochs.
7. Summary and Context
HAT-GAE introduces a curriculum-inspired masking approach that leverages quantifiable feature and node importance to tailor self-supervised graph representation learning, while trainable corruption provides adversarial challenge adapted to the learned data manifold. The resulting architecture is simple, requiring only a single cosine reconstruction loss, and empirically robust, matching or surpassing contemporary generative and contrastive graph neural network pretraining methods on diverse benchmarks. The method exemplifies the effectiveness of integrating adaptive, hierarchy-aware data corruption and progressive self-supervision for non-Euclidean domains (Sun, 2023).