Papers
Topics
Authors
Recent
2000 character limit reached

Polymer-JEPA: Self-Supervised Polymer Graphs

Updated 26 November 2025
  • The paper adapts the JEPA architecture for polymer molecular graphs to improve downstream property prediction under low-data conditions.
  • It employs dual-encoder models with dynamic subgraph sampling methods like random-walk and METIS to capture semantically rich polymer structures.
  • Experiments show Polymer-JEPA boosts regression and classification metrics, demonstrating robust transferability across diverse polymer datasets.

Polymer-JEPA is a ML framework that adapts the Joint Embedding Predictive Architecture (JEPA) for self-supervised pretraining on polymer molecular graphs. Developed in response to the scarcity of high-quality labeled polymer datasets, Polymer-JEPA aims to enhance downstream task performance, particularly under data-limited regimes, by learning semantically rich structural representations of stochastic polymer graphs. The approach leverages context-target prediction at the graph embedding level and incorporates dynamic subgraph sampling based on polymer-specific graph topology (Piccoli et al., 22 Jun 2025).

1. Graph Representation and JEPA Architecture

Polymer-JEPA models each polymer as a stochastic graph G=(V,E)G=(V, E), as proposed by Aldeghi & Coley (2022), where nodes VV denote monomer subgraphs with atom- and bond-level features, and edges EE are weighted by connection probabilities puvp_{uv} indicating the likelihood that monomer uu is linked to monomer vv in the resulting copolymer chain. The graph neural network (GNN) backbone is a weighted, directed message-passing neural network (wD-MPNN) with node-centered message passing. Initial node features include atom types and learned monomer identity embeddings, while edge features encode bond types and stochastic weights.

A dual-encoder scheme governs data processing:

  • Context encoder EctxE_\mathrm{ctx} ingests a “context subgraph” xGx \subset G (typically 50–75% of nodes).
  • Target encoder EtgtE_\mathrm{tgt} runs on the full graph, producing node embeddings pooled over one or more “target subgraphs” y1,,ymy_1,\,\ldots,\,y_m (each ~10–20% of nodes).

The JEPA predictor hϕh_\phi (MLP) takes the context embedding sxs_x:

sx=Poolnodes(Ectx(x))s_x = \mathrm{Pool}_\mathrm{nodes}(E_\mathrm{ctx}(x))

augments it with a learned positional token p~iT~\tilde{p}_i \cdot \tilde{T} (T~Rk×d\tilde{T} \in \mathbb{R}^{k \times d}), and outputs a prediction:

s^y(i)=hϕ(sx+p~iT~)\hat{s}_y(i) = h_\phi(s_x + \tilde{p}_i \tilde{T})

A pseudolabel predictor gψg_\psi (second MLP) employs the full graph embedding to predict the polymer molecular weight MwM_w as an auxiliary SSL task.

2. Subgraph Sampling and Self-Supervised Augmentation

Polymer-JEPA avoids standard node/edge masking. Instead, it generates two “views” of the graph by partitioning GG into context and target subgraphs, leveraging one of three algorithms:

  • Random-walk sampling
  • Motif-based partitioning via r-BRICS
  • METIS graph partitioning

At each training epoch, subgraphs are dynamically resampled, producing varied context-target prediction tasks and supporting robust augmentation via data permutations distinct from atom-level masking approaches.

3. Training Objectives and Optimization

The principal self-supervised objective for Polymer-JEPA is a mean-squared error in embedding space over mm targets:

LJEPA=1mi=1ms^y(i)sy(i)22L_\mathrm{JEPA} = \frac{1}{m} \sum_{i=1}^{m} \| \hat{s}_y(i) - s_y(i) \|_2^2

where sy(i)=Poolnodes(Etgt(yi))s_y(i) = \mathrm{Pool}_\mathrm{nodes}\left(E_\mathrm{tgt}(y_i)\right).

An auxiliary pseudolabel loss encourages the target encoder to predict the polymer molecular weight:

LPL=gψ(Poolnodes(Etgt(G)))Mw22L_\mathrm{PL} = \|g_\psi\big(\mathrm{Pool}_\mathrm{nodes}\left(E_\mathrm{tgt}(G)\right)\big) - M_w\|_2^2

The total pretraining loss combines both objectives, with weighting hyperparameter λ\lambda set to 1 in practice:

L=LJEPA+λLPLL = L_\mathrm{JEPA} + \lambda\,L_\mathrm{PL}

4. Pretraining Dataset and Featurization

Pretraining is performed on the conjugated-copolymer photocatalyst dataset from Aldeghi & Coley (2022), itself constructed atop Bai et al. (2019). This corpus contains 42,966 polymers; 40% (17,186 polymers) are used for JEPA pretraining. Polymers are represented as 2D stochastic graphs:

  • Node features: atom types and learned monomer-identity embeddings
  • Edge features: bond types plus stochastic connection probabilities No 3D coordinates or geometric features are employed; the approach relies solely on 2D graph structure.

5. Downstream Application: Fine-Tuning and Benchmark Tasks

Electron-Affinity Regression

Fine-tuning utilizes the residual 40% (~17,000) copolymers, with stratified labeled-data splits for evaluation (ranging from 0.4% [192 points] to 24% [10,311]). The pretrained E_tgt weights are transferred, followed by an appended MLP regressor trained end-to-end; final results are reported using five-fold cross-validation and repeated splits.

Diblock Phase Classification (Cross-Space Transfer)

A separate diblock copolymer phase-behavior dataset (Arora et al., 2021) comprising 4,780 labeled samples over five phase classes is used to test cross-space transfer. Fine-tuning regimes range from 4% (191 samples) to 80% (3,824 samples), and evaluation employs area-under-PR-curve (AUPRC) on the held-out test partition.

6. Performance Analysis and Ablation Studies

Quantitative Results

A summary of results for key tasks:

Task No Pretraining Polymer-JEPA Notes
EA regression (R², 0.4% data) ~0.10 ± 0.15 ~0.45 ± 0.07 Plateau at ~4% data; both converge at ~0.90 R²
EA regression (R², 0.8% data) ~0.25 ~0.50
Diblock classification (AUPRC) Varies (lower) +0.02–0.10 higher Transfer gains up to +0.05 (@80% data)
RF baseline (EA, 0.4–0.8%) Outperforms JEPA - JEPA exceeds RF at 4%+ data
RF baseline (diblock) Outperforms JEPA - Leveraging mole-fraction correlations

Subgraphing ablations (EA, 0.4% regime) show optimal context size at 60% of nodes (R2=0.65±0.03R^2=0.65\pm 0.03; RMSE 0.35±0.020.35\pm 0.02), target size at 10% of nodes (R2=0.66±0.02R^2=0.66\pm 0.02), and one target patch per graph yields best performance (R2=0.67±0.01R^2=0.67\pm 0.01). Random-walk and METIS methods provide superior subgraph partitioning.

7. Interpretation, Limitations, and Future Directions

Polymer-JEPA’s embedding-space prediction compels the GNN to encode semantically rich features, focusing on global chemical dependencies, monomer sequences, and polymer topology rather than local reconstruction. The context-to-target mapping, augmented by positional RWSE tokens, facilitates learning of structural geometry inherent in polymer graphs. Cross-space transfer, evidenced by consistent AUPRC gains on diblock copolymer data, suggests the model captures generalized polymer chemistry motifs.

Performance limitations include cases where random forest baselines leveraging simple molecular statistics (e.g., mole fraction for phase prediction) can exceed JEPA, especially in ultra-low data contexts. Absence of 3D geometric information and processing-condition metadata, coupled with corpus size restrictions, define current limitations.

Proposed future directions include scaling pretraining to larger and more chemically diverse databases (e.g., PI1M), incorporation of 3D conformer and process data for richer embedding, exploration of graph-transformer context/target encoders, and the integration of contrastive learning (e.g., VICReg-JEPA) within the embedding space to further improve representation quality and transferability (Piccoli et al., 22 Jun 2025).

Polymer-JEPA thus exemplifies the utility of joint embedding predictive architectures in polymer informatics, providing measurable benefits in property prediction, especially under data-scarce regimes, and suggesting promising generalization across polymer chemical spaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Polymer-JEPA.