Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Multi-View Graph Masked Autoencoder

Updated 2 September 2025
  • The paper introduces a framework that adapts the masked autoencoder paradigm to multi-view graph data by encoding separate modalities and fusing both local and global information.
  • It employs innovative masking, multi-view decoding, and Laplacian regularization techniques to create robust, transferable embeddings for tasks like clustering and node classification.
  • Experimental results demonstrate notable gains in metrics such as NMI and accuracy, confirming the effectiveness of multi-scale, hierarchical architectures and cross-view regularization.

Multi-view graph masked autoencoder frameworks constitute a burgeoning family of self-supervised representation learning techniques that leverage the aggregation and reconstruction of information distributed across multiple graph views, modalities, or granularities. These frameworks employ innovative masking, fusion, and decoding strategies—often integrating local and global graph information, multi-scale architectures, and tailored reconstruction objectives—to produce robust, transferable embeddings for tasks that include clustering, classification, property prediction, and control. This overview synthesizes key developments and operational principles from leading works in the area.

1. Architectural Principles of Multi-View Graph Masked Autoencoders

Multi-view graph masked autoencoders generalize the classic masked autoencoder paradigm to graph-structured data with multiple distinct perspectives or feature sets. In the archetypal setting (Zheng et al., 2020), for vv views of a dataset (where each view might correspond to a distinct modality, measurement type, or feature set), separate autoencoder networks are trained per view:

  • Inputs X(k)X^{(k)} for view kk are encoded to latent representations Z(k)Z^{(k)}.
  • Reconstruction loss: L1(k)=12X(k)X~(k)F2\mathcal{L}_1^{(k)} = \frac{1}{2} \| X^{(k)} - \tilde{X}^{(k)} \|_F^2.
  • Self-expressive property: enforced by a shared subspace matrix CC, via L2(k)=12Z(k)Z(k)CF2\mathcal{L}_2^{(k)} = \frac{1}{2} \| Z^{(k)} - Z^{(k)}C \|_F^2.

This multi-view arrangement enables the framework to learn representations that are both modality-adapted and prepared for downstream linear subspace clustering. More recently, semi-parametric approaches (Shi et al., 2023) reconstruct collaborative latent targets, integrating topological and attribute embeddings from multiple external teacher models; this multi-view latent combination is a generalization for scenarios with heterogeneous graph data.

2. Fusion of Local and Global Graph Information

Prominent multi-view graph masked autoencoder frameworks distinguish themselves by their explicit treatment of local and global graph signals. For example, MSCNLG (Zheng et al., 2020) computes, per view, both first-order (local) and second-order (global) graph similarity matrices:

  • First-order (local) proximity: Wij(k)=exp(Xi(k)Xj(k)22/σ2)W_{ij}^{(k)} = \exp\left(-\|X_i^{(k)} - X_j^{(k)}\|_2^2 / \sigma^2\right) (only for mutual kk-NNs).
  • Second-order (global) proximity: W^ij(k)=exp(Wi(k)Wj(k)22/σ2)\hat{W}_{ij}^{(k)} = \exp\left(-\|W_i^{(k)} - W_j^{(k)}\|_2^2 / \sigma^2\right).
  • These metrics are fused across views using Hadamard products: W=k=1vW(k)+k=1vW^(k)W = \bigodot_{k=1}^{v} W^{(k)} + \bigodot_{k=1}^{v} \hat{W}^{(k)}.

The fused graph WW, which reflects both short-range and long-range relational information, is then integrated as a Laplacian-based graph regularizer in the optimization objective: L3=12i,jWijCiCj22=Tr(CLC)\mathcal{L}_3 = \frac{1}{2} \sum_{i,j} W_{ij} \|C_i - C_j\|_2^2 = \operatorname{Tr}(C^\top LC).

3. Masking Strategies and Multi-view Decoding

Modern multi-view graph masked autoencoders often employ advanced masking schemes to induce robustness and provide nontrivial reconstruction signals. These include:

  • High-ratio masking of graph edges (Tan et al., 2022) or node features (Hou et al., 2023), sometimes up to 70%, creating challenging self-supervisory tasks that force the encoder to learn highly expressive representations.
  • Multi-view random re-mask decoding (Hou et al., 2023), in which the latent representations are repeatedly corrupted with fresh masks across KK decoding streams (views), each providing a separate reconstruction target:

Linput=1V~j=1KiV~(1xix^i(j)xix^i(j))γ\mathcal{L}_\text{input} = \frac{1}{|\tilde{\mathcal{V}}|} \sum_{j=1}^{K} \sum_{i \in \tilde{\mathcal{V}}} \left(1 - \frac{x_i^\top \hat{x}_i^{(j)}}{\|x_i\| \cdot \|\hat{x}_i^{(j)}\|}\right)^\gamma

Other frameworks utilize adaptively sampled masks (Tian et al., 12 Feb 2024), semantic-guided masking based on external segmentation models (Dakic et al., 7 Oct 2024), or discrepancy-aware edge selection via attention reversal (Zheng et al., 24 Jun 2025) to focus reconstruction on informative or unique regions.

4. Cross-view Regularization and Spectral Clustering

After multi-view encoding and reconstruction, several frameworks produce a cross-view or fused affinity matrix representing inter-sample similarities. The shared subspace representation CC (Zheng et al., 2020)—enforced via self-expressive and Laplacian losses—serves as the affinity matrix for spectral clustering:

  • Final cluster assignments are obtained by applying a spectral clustering algorithm on either CC or its symmetrized version: A=C+C2A = \frac{|C| + |C^\top|}{2}.
  • The spectral Laplacian and eigenvalue decomposition extract the principal subspace for standard k-means clustering.

This approach robustly unifies information across multiple views, as the fusion procedure ensures both within-view and across-view consistency.

5. Theoretical Justification: Mutual Information and Contrastive Perspectives

A central theoretical insight (Li et al., 2022) is that masked graph autoencoding is equivalent to mutual information maximization between paired subgraph views:

  • The autoencoder objective is LGAE=(L++L)\mathcal{L}_\text{GAE} = -(\mathcal{L}^+ + \mathcal{L}^-), where L+\mathcal{L}^+ and L\mathcal{L}^- are log-likelihoods of positive (connected) and negative (unconnected) node pairs.
  • For optimal decoders and rich embedding spaces, minimizing the reconstruction loss maximizes mutual information I(U;V)I(U; V) between kk-hop subgraph representations UU and VV.

Masking has the effect of reducing redundancy and overlap between subgraph views, yielding richer, more generalizable representations for transfer and semi-supervised applications.

6. Experimental Evaluation and Empirical Findings

Empirical studies report consistent superiority of multi-view graph masked autoencoder frameworks across diverse benchmarks:

  • On multi-view clustering datasets (e.g., Yale Face, ORL, Caltech101-20), MSCNLG (Zheng et al., 2020) achieves NMI, ACC, F-measure, and RI scores surpassing state-of-the-art baselines, illustrating the value of fusing both local and global graph information.
  • GraphMAE2 (Hou et al., 2023) yields at least 2.45% improvement in accuracy over leading methods on ultra-large graphs (e.g., ogbn-Papers100M).
  • Discrepancy-aware DGMAE (Zheng et al., 24 Jun 2025) achieves top rankings on heterophilic node classification tasks by reconstructing both common features and inter-node discrepancies.
  • Resource-efficient multi-view perception applications (Dakic et al., 7 Oct 2024) preserve detection and tracking performance at high masking ratios, with up to 13.33×\times data compression.

A summary of key metrics from selected methods is shown below:

Method Task Type Metric Reported Value / Improvement
MSCNLG Multi-view clustering NMI (Yale Face) 0.9012
GraphMAE2 Node classification ogbn-Papers100M +2.45% accuracy over baselines
DGMAE Node classification Heterophilic Best of 16/17 datasets (Accuracy, NMI, ARI)
(Dakic et al., 7 Oct 2024) Multi-view detection/tracking MODA/MOTA Maintains SOTA performance @ 70% masking

7. Extensions: Hierarchical and Multi-modal Masked Autoencoding

Recent progress extends principles of multi-view graph masked autoencoding to hierarchical (Liu et al., 17 May 2024) and multi-modal contexts. Hi-GMAE (Liu et al., 17 May 2024) constructs multi-scale graph hierarchies via pooling, initializes masking at coarsest scales, and gradually recovers missing nodes while unifying GNN encoders (fine scale) with graph transformer encoders (coarse scale). This multi-scale arrangement outperforms 17 state-of-the-art competitors on 15 datasets, demonstrating that hierarchical granularity can further enhance representation quality.

A plausible implication is that, as multi-view data become richer—spanning modalities, scales, and semantic domains—selectively masking, decoding, and fusing representations across both views and scales will be increasingly central to robust graph self-supervision.

8. Outlook and Future Directions

The evolution of multi-view graph masked autoencoders highlights several promising directions:

These developments indicate that multi-view graph masked autoencoder frameworks will play a pivotal role in advancing generalization, scalability, and robustness for the next generation of graph representation learning, clustering, and downstream inference.