Multi-View Graph Masked Autoencoder

Updated 2 September 2025

The paper introduces a framework that adapts the masked autoencoder paradigm to multi-view graph data by encoding separate modalities and fusing both local and global information.
It employs innovative masking, multi-view decoding, and Laplacian regularization techniques to create robust, transferable embeddings for tasks like clustering and node classification.
Experimental results demonstrate notable gains in metrics such as NMI and accuracy, confirming the effectiveness of multi-scale, hierarchical architectures and cross-view regularization.

Multi-view graph masked autoencoder frameworks constitute a burgeoning family of self-supervised representation learning techniques that leverage the aggregation and reconstruction of information distributed across multiple graph views, modalities, or granularities. These frameworks employ innovative masking, fusion, and decoding strategies—often integrating local and global graph information, multi-scale architectures, and tailored reconstruction objectives—to produce robust, transferable embeddings for tasks that include clustering, classification, property prediction, and control. This overview synthesizes key developments and operational principles from leading works in the area.

1. Architectural Principles of Multi-View Graph Masked Autoencoders

Multi-view graph masked autoencoders generalize the classic masked autoencoder paradigm to graph-structured data with multiple distinct perspectives or feature sets. In the archetypal setting (Zheng et al., 2020), for $v$ views of a dataset (where each view might correspond to a distinct modality, measurement type, or feature set), separate autoencoder networks are trained per view:

Inputs $X^{(k)}$ for view $k$ are encoded to latent representations $Z^{(k)}$ .
Reconstruction loss: $\mathcal{L}_1^{(k)} = \frac{1}{2} \| X^{(k)} - \tilde{X}^{(k)} \|_F^2$ .
Self-expressive property: enforced by a shared subspace matrix $C$ , via $\mathcal{L}_2^{(k)} = \frac{1}{2} \| Z^{(k)} - Z^{(k)}C \|_F^2$ .

This multi-view arrangement enables the framework to learn representations that are both modality-adapted and prepared for downstream linear subspace clustering. More recently, semi-parametric approaches (Shi et al., 2023) reconstruct collaborative latent targets, integrating topological and attribute embeddings from multiple external teacher models; this multi-view latent combination is a generalization for scenarios with heterogeneous graph data.

2. Fusion of Local and Global Graph Information

Prominent multi-view graph masked autoencoder frameworks distinguish themselves by their explicit treatment of local and global graph signals. For example, MSCNLG (Zheng et al., 2020) computes, per view, both first-order (local) and second-order (global) graph similarity matrices:

First-order (local) proximity: $W_{ij}^{(k)} = \exp\left(-\|X_i^{(k)} - X_j^{(k)}\|_2^2 / \sigma^2\right)$ (only for mutual $k$ -NNs).
Second-order (global) proximity: $\hat{W}_{ij}^{(k)} = \exp\left(-\|W_i^{(k)} - W_j^{(k)}\|_2^2 / \sigma^2\right)$ .
These metrics are fused across views using Hadamard products: $W = \bigodot_{k=1}^{v} W^{(k)} + \bigodot_{k=1}^{v} \hat{W}^{(k)}$ .

The fused graph $W$ , which reflects both short-range and long-range relational information, is then integrated as a Laplacian-based graph regularizer in the optimization objective: $\mathcal{L}_3 = \frac{1}{2} \sum_{i,j} W_{ij} \|C_i - C_j\|_2^2 = \operatorname{Tr}(C^\top LC)$ .

3. Masking Strategies and Multi-view Decoding

Modern multi-view graph masked autoencoders often employ advanced masking schemes to induce robustness and provide nontrivial reconstruction signals. These include:

High-ratio masking of graph edges (Tan et al., 2022) or node features (Hou et al., 2023), sometimes up to 70%, creating challenging self-supervisory tasks that force the encoder to learn highly expressive representations.
Multi-view random re-mask decoding (Hou et al., 2023), in which the latent representations are repeatedly corrupted with fresh masks across $K$ decoding streams (views), each providing a separate reconstruction target:

$\mathcal{L}_\text{input} = \frac{1}{|\tilde{\mathcal{V}}|} \sum_{j=1}^{K} \sum_{i \in \tilde{\mathcal{V}}} \left(1 - \frac{x_i^\top \hat{x}_i^{(j)}}{\|x_i\| \cdot \|\hat{x}_i^{(j)}\|}\right)^\gamma$

Other frameworks utilize adaptively sampled masks (Tian et al., 12 Feb 2024), semantic-guided masking based on external segmentation models (Dakic et al., 7 Oct 2024), or discrepancy-aware edge selection via attention reversal (Zheng et al., 24 Jun 2025) to focus reconstruction on informative or unique regions.

4. Cross-view Regularization and Spectral Clustering

After multi-view encoding and reconstruction, several frameworks produce a cross-view or fused affinity matrix representing inter-sample similarities. The shared subspace representation $C$ (Zheng et al., 2020)—enforced via self-expressive and Laplacian losses—serves as the affinity matrix for spectral clustering:

Final cluster assignments are obtained by applying a spectral clustering algorithm on either $C$ or its symmetrized version: $A = \frac{|C| + |C^\top|}{2}$ .
The spectral Laplacian and eigenvalue decomposition extract the principal subspace for standard k-means clustering.

This approach robustly unifies information across multiple views, as the fusion procedure ensures both within-view and across-view consistency.

5. Theoretical Justification: Mutual Information and Contrastive Perspectives

A central theoretical insight (Li et al., 2022) is that masked graph autoencoding is equivalent to mutual information maximization between paired subgraph views:

The autoencoder objective is $\mathcal{L}_\text{GAE} = -(\mathcal{L}^+ + \mathcal{L}^-)$ , where $\mathcal{L}^+$ and $\mathcal{L}^-$ are log-likelihoods of positive (connected) and negative (unconnected) node pairs.
For optimal decoders and rich embedding spaces, minimizing the reconstruction loss maximizes mutual information $I(U; V)$ between $k$ -hop subgraph representations $U$ and $V$ .

Masking has the effect of reducing redundancy and overlap between subgraph views, yielding richer, more generalizable representations for transfer and semi-supervised applications.

6. Experimental Evaluation and Empirical Findings

Empirical studies report consistent superiority of multi-view graph masked autoencoder frameworks across diverse benchmarks:

On multi-view clustering datasets (e.g., Yale Face, ORL, Caltech101-20), MSCNLG (Zheng et al., 2020) achieves NMI, ACC, F-measure, and RI scores surpassing state-of-the-art baselines, illustrating the value of fusing both local and global graph information.
GraphMAE2 (Hou et al., 2023) yields at least 2.45% improvement in accuracy over leading methods on ultra-large graphs (e.g., ogbn-Papers100M).
Discrepancy-aware DGMAE (Zheng et al., 24 Jun 2025) achieves top rankings on heterophilic node classification tasks by reconstructing both common features and inter-node discrepancies.
Resource-efficient multi-view perception applications (Dakic et al., 7 Oct 2024) preserve detection and tracking performance at high masking ratios, with up to 13.33 $\times$ data compression.

A summary of key metrics from selected methods is shown below:

Method	Task Type	Metric	Reported Value / Improvement
MSCNLG	Multi-view clustering	NMI (Yale Face)	0.9012
GraphMAE2	Node classification	ogbn-Papers100M	+2.45% accuracy over baselines
DGMAE	Node classification	Heterophilic	Best of 16/17 datasets (Accuracy, NMI, ARI)
(Dakic et al., 7 Oct 2024)	Multi-view detection/tracking	MODA/MOTA	Maintains SOTA performance @ 70% masking

Recent progress extends principles of multi-view graph masked autoencoding to hierarchical (Liu et al., 17 May 2024) and multi-modal contexts. Hi-GMAE (Liu et al., 17 May 2024) constructs multi-scale graph hierarchies via pooling, initializes masking at coarsest scales, and gradually recovers missing nodes while unifying GNN encoders (fine scale) with graph transformer encoders (coarse scale). This multi-scale arrangement outperforms 17 state-of-the-art competitors on 15 datasets, demonstrating that hierarchical granularity can further enhance representation quality.

A plausible implication is that, as multi-view data become richer—spanning modalities, scales, and semantic domains—selectively masking, decoding, and fusing representations across both views and scales will be increasingly central to robust graph self-supervision.

8. Outlook and Future Directions

The evolution of multi-view graph masked autoencoders highlights several promising directions:

Integration with multi-modal targets (e.g., external embeddings (Shi et al., 2023), camera views (Seo et al., 2023), or semantic cues (Dakic et al., 7 Oct 2024)).
Fine-grained masking and discrepancy-aware reconstruction for applications in heterophilic graph domains (Zheng et al., 24 Jun 2025).
Hierarchical architectures to capture multi-level organization (Liu et al., 17 May 2024).
Cross-view theorems linking masking, representation disentanglement, and mutual information maximization (Li et al., 2022).

These developments indicate that multi-view graph masked autoencoder frameworks will play a pivotal role in advancing generalization, scalability, and robustness for the next generation of graph representation learning, clustering, and downstream inference.