Graph-JEPA: Self-Supervised Graph Learning

Updated 18 November 2025

The paper presents Graph-JEPA, a joint-embedding predictive model that leverages masked subgraph modeling and hyperbolic encoding to capture hierarchical graph structures.
It partitions graphs into patches using METIS, encodes them via GNNs and Transformer blocks, and predicts hyperbola-parameterized targets with a Smooth-L1 loss.
Empirical results demonstrate state-of-the-art performance on graph classification, regression, and isomorphism tasks, with notable gains in low-data polymer applications.

Graph-JEPA is a Joint-Embedding Predictive Architecture tailored for self-supervised graph-level representation learning. It extends the joint-embedding framework—originally developed for other domains—by focusing on masked modeling and predictive subgraph objectives, resulting in hierarchical and semantically rich embeddings. In Graph-JEPA, the prediction target is not a raw input reconstruction but a geometric transformation of the latent encoding, frequently parameterized on a unit hyperbola to encode implicit hierarchy in graph concepts. The paradigm has demonstrated state-of-the-art performance in standard graph classification, regression, and graph isomorphism distinction, and has been adapted to domains such as polymer molecular graphs with notable low-data gains (Skenderi et al., 2023, Piccoli et al., 22 Jun 2025).

1. Architectural Overview

Graph-JEPA operates by partitioning graphs into disjoint subgraphs or "patches" and leveraging dual encoder networks to establish a predictive relationship between a context patch and a set of masked target patches. The architectural flow consists of the following main stages (Skenderi et al., 2023):

Subgraph Extraction: Input graph $G=(V,E)$ is decomposed into $p$ non-overlapping patches using the METIS partitioner, with each patch expanded by its one-hop neighborhood to maintain locality. Typical settings use $p∈\{32,128\}$ .
Patch Encoding: Each patch $s_i$ is encoded using a graph neural network (GNN), specifically GINE (a GIN variant supporting edge features). The patch representation $h_i ∈ \mathbb{R}^d$ is obtained via mean pooling; $d=512$ is standard.
Positional Embeddings: To incorporate structural information, a random-walk structural embedding (RWSE) is computed for each node. The patch-level embedding $P_i$ is the elementwise maximum of its constituent nodes' RWSE vectors, with RWSE depth $k$ in $[15, 40]$ .
Context and Target Encoders: The context patch is encoded with a stack of Transformer blocks ( $E_c$ , 4 layers, MLPs only), whereas the targets are encoded via another Transformer stack ( $E_t$ ) with Hadamard self-attention. No parameter sharing is used between $E_c$ and $E_t$ .
Predictor Network: A two-layer MLP receives the normalized sum of the context encoding and the target positional embedding, mapping them to $\mathbb{R}^2$ .

For polymer molecular graphs, the framework adapts by having context and target encoders instantiated as weighted-directed message-passing neural networks (wD-MPNNs), where each encoder pools node embeddings from their respective subgraphs (Piccoli et al., 22 Jun 2025).

2. Joint-Embedding Predictive Objective

The core objective of Graph-JEPA is to predict, given a context subgraph’s encoding, the hyperbolic coordinates of a target subgraph’s encoding derived from the full graph. This is formalized as follows (Skenderi et al., 2023):

Target Representation Parametrization: For target patch $y_l$ , the target encoder $E_t$ produces a high-dimensional embedding $Z^{y_l} ∈ \mathbb{R}^d$ . This is collapsed into a scalar hyperbolic angle:

$α_l = \frac{1}{d} \sum_{n=1}^d Z^{y_l}_n$

The target for prediction is the pair

$ψ_l^y = (\cosh(α_l), \sinh(α_l)) ∈ \mathbb{R}^2$

corresponding to a point on the unit hyperbola $x^2-y^2=1$ .

Prediction: For each target, the output of the predictor MLP is:

$\hat{ψ}^y_l = \mathrm{MLP}(\mathrm{LayerNorm}(z^x + P_l)) ∈ \mathbb{R}^2$

where $z^x=E_c(x)$ is the context encoding, and $P_l$ is the positional embedding for the target.

Loss Function: The model is trained with Smooth-L1 (Huber) loss averaged over $m$ target patches:

$L_\mathrm{JEPA} = \frac{1}{m} \sum_{l=1}^m D(\hat{\psi}_l^y, \psi_l^y)$

where

$D(a, b) = \sum_{k=1}^2 S_{\beta}(a_k - b_k), \quad S_{\beta}(r) = \begin{cases} 0.5 r^2/\beta & |r|<\beta \ |r| - 0.5\beta & \text{otherwise} \end{cases}$

with $\beta=1$ in practice.

Polymer-JEPA (Piccoli et al., 22 Jun 2025) uses a similar framework but predicts full-embedding space targets:

$L_\mathrm{JEPA} = \|\hat{z}_t - z_t\|_2^2$

In both cases, negative sampling is avoided by dynamically resampling subgraphs each iteration, enforcing minimal context-target overlap, and by the asymmetric, predictive-only objective.

3. Training Protocols and Hyperparameters

Graph-JEPA optimization uses Adam with a learning rate of $\approx10^{-3}$ and weight decay $\approx10^{-5}$ . Batch size varies by graph size ($64$–$128$). The subgraph count $p$ and number of context/target patches are set according to the size and complexity of the dataset (small graphs: $p=32$ , large: $p=128$ ; $m=2$ –$4$ targets per graph) (Skenderi et al., 2023).

Transformer depth for the encoders is fixed at $4$ layers; GNN stack for patch encoding is $2$–$3$ layers. No masking curriculum or complex learning rate schedules are introduced beyond warmup/decay. Synthetic and molecular graphs use similar subgraphing strategies, with polymer-jepa relying on random-walk or motif-based patches and context sizes of approximately $60\%$ of nodes, targets at $10\%$ (Piccoli et al., 22 Jun 2025).

To prevent collapse, two measures are critical:

Exponential Moving Average (EMA): The target encoder $E_t$ is not updated through backpropagation but via EMA of the context encoder parameters, in line with BYOL/SimSiam designs.
Dynamic Masking and Patch Selection: Masked subgraphs and their context/target assignments are resampled each iteration.

4. Empirical Performance and Ablations

Graph-JEPA shows substantial gains in both classical and challenging graph tasks:

Graph Classification and Regression (Skenderi et al., 2023):
- On TUD datasets (e.g., REDDIT-B), achieves $91.99\%$ vs $88.01\%$ for GraphMAE and $85.5\%$ for AD-GCL.
- On ZINC regression, reaches MAE $\approx 0.434$ vs $0.578$ (AD-GCL-OPT) and $0.627$ (GraphCL).
- Supervised baselines (F-GIN) underperform pretraining.
Non-Isomorphic Graph Distinction: On the EXP dataset (where 1-WL fails), Graph-JEPA attains $98.8\%$ , close to supervised SOTA.
Ablations:
- Latent Objective: Hyperbola parametrization outperforms high-dimensional Euclidean or high-dim hyperbolic embeddings for hierarchy encoding.
- Positional Encoding: Node-level RWSE is slightly superior to patch-level RWSE.
- Attention Variants: Both standard and Hadamard self-attention are effective, but Hadamard is more stable.
- Patch Extraction: METIS partitioning is more robust vs. fully random subgraphs.

Polymer-JEPA extends these findings: on electron-affinity regression, pretraining elevates $R^2$ from $\sim0.46$ to $\sim0.65$ in the $0.4\%$ -label regime, with substantial gains persisting up to $4\%$ label ratio. Cross-domain transfer is demonstrated on diblock copolymer classification, with AUPRC improvement of $\sim0.02$ –$0.10$ at all data fractions (Piccoli et al., 22 Jun 2025).

5. Qualitative and Interpretative Analyses

Qualitative probes into learned spaces reveal:

Semantic Clustering: t-SNE applied to the hyperbolically-parameterized embeddings of graph datasets (e.g., DD) shows class-driven clustering, in contrast to the collapse observed with naive Euclidean objectives.
Hyperbolic Visualization: On molecular data (e.g., ZINC), patches align on the 2D unit hyperbola, with context near $(1,0)$ and functionally related targets dispersed along both branches, preserving interpretable structure.
Intuitive Design Benefits: The predictive objective in embedding space (as opposed to masked input reconstruction) focuses representational capacity on semantically salient graph features, while the entire-graph context for the target encoding injects global structure (Skenderi et al., 2023, Piccoli et al., 22 Jun 2025).

6. Transfer to Low-Label and Scientific Domains

Graph-JEPA’s pretraining under the JEPA paradigm affords significant improvements in limited-label regimes, particularly evidenced in polymer molecular applications:

Subgraph dynamic masking, pooled context/target embeddings, and a predictive-only, non-contrastive loss enable robust transfer to practical scientific property prediction, including regression and classification.
The architecture, when applied to conjugated copolymer property prediction, scales well to unseen tasks, with measurable improvement in downstream performance even at minimal data availability (Piccoli et al., 22 Jun 2025).

7. Connections, Limitations, and Future Directions

Graph-JEPA departs from contrastive and generative SSL by employing a purely predictive energy-based paradigm circumvents the need for negative sampling while also preventing trivial solution collapse via architectural asymmetry and predictive subgraph re-sampling. The unit-hyperbola objective provides a stable, low-dimensional hierarchical code robust across varied datasets.

A plausible implication is that JEPA-based self-supervision may generalize to other modalities where implicit global structure and semantic hierarchy are paramount. Potential future directions include refining subgraph partitioning heuristics, extending positional encodings, and deepening the paper of the geometry induced by the hyperbolic loss, particularly as the architecture is adapted for increasingly heterogeneous or scientific graph domains (Skenderi et al., 2023, Piccoli et al., 22 Jun 2025).

PDF Markdown Chat (Pro)

References (2)

Graph-level Representation Learning with Joint-Embedding Predictive Architectures (2023)

Joint Embedding Predictive Architecture for self-supervised pretraining on polymer molecular graphs (2025)

Follow Topic

Get notified by email when new papers are published related to Graph-JEPA.