Polymer-JEPA: Self-Supervised Polymer Graphs
- The paper adapts the JEPA architecture for polymer molecular graphs to improve downstream property prediction under low-data conditions.
- It employs dual-encoder models with dynamic subgraph sampling methods like random-walk and METIS to capture semantically rich polymer structures.
- Experiments show Polymer-JEPA boosts regression and classification metrics, demonstrating robust transferability across diverse polymer datasets.
Polymer-JEPA is a ML framework that adapts the Joint Embedding Predictive Architecture (JEPA) for self-supervised pretraining on polymer molecular graphs. Developed in response to the scarcity of high-quality labeled polymer datasets, Polymer-JEPA aims to enhance downstream task performance, particularly under data-limited regimes, by learning semantically rich structural representations of stochastic polymer graphs. The approach leverages context-target prediction at the graph embedding level and incorporates dynamic subgraph sampling based on polymer-specific graph topology (Piccoli et al., 22 Jun 2025).
1. Graph Representation and JEPA Architecture
Polymer-JEPA models each polymer as a stochastic graph , as proposed by Aldeghi & Coley (2022), where nodes denote monomer subgraphs with atom- and bond-level features, and edges are weighted by connection probabilities indicating the likelihood that monomer is linked to monomer in the resulting copolymer chain. The graph neural network (GNN) backbone is a weighted, directed message-passing neural network (wD-MPNN) with node-centered message passing. Initial node features include atom types and learned monomer identity embeddings, while edge features encode bond types and stochastic weights.
A dual-encoder scheme governs data processing:
- Context encoder ingests a “context subgraph” (typically 50–75% of nodes).
- Target encoder runs on the full graph, producing node embeddings pooled over one or more “target subgraphs” (each ~10–20% of nodes).
The JEPA predictor (MLP) takes the context embedding :
augments it with a learned positional token (), and outputs a prediction:
A pseudolabel predictor (second MLP) employs the full graph embedding to predict the polymer molecular weight as an auxiliary SSL task.
2. Subgraph Sampling and Self-Supervised Augmentation
Polymer-JEPA avoids standard node/edge masking. Instead, it generates two “views” of the graph by partitioning into context and target subgraphs, leveraging one of three algorithms:
- Random-walk sampling
- Motif-based partitioning via r-BRICS
- METIS graph partitioning
At each training epoch, subgraphs are dynamically resampled, producing varied context-target prediction tasks and supporting robust augmentation via data permutations distinct from atom-level masking approaches.
3. Training Objectives and Optimization
The principal self-supervised objective for Polymer-JEPA is a mean-squared error in embedding space over targets:
where .
An auxiliary pseudolabel loss encourages the target encoder to predict the polymer molecular weight:
The total pretraining loss combines both objectives, with weighting hyperparameter set to 1 in practice:
4. Pretraining Dataset and Featurization
Pretraining is performed on the conjugated-copolymer photocatalyst dataset from Aldeghi & Coley (2022), itself constructed atop Bai et al. (2019). This corpus contains 42,966 polymers; 40% (17,186 polymers) are used for JEPA pretraining. Polymers are represented as 2D stochastic graphs:
- Node features: atom types and learned monomer-identity embeddings
- Edge features: bond types plus stochastic connection probabilities No 3D coordinates or geometric features are employed; the approach relies solely on 2D graph structure.
5. Downstream Application: Fine-Tuning and Benchmark Tasks
Electron-Affinity Regression
Fine-tuning utilizes the residual 40% (~17,000) copolymers, with stratified labeled-data splits for evaluation (ranging from 0.4% [192 points] to 24% [10,311]). The pretrained E_tgt weights are transferred, followed by an appended MLP regressor trained end-to-end; final results are reported using five-fold cross-validation and repeated splits.
Diblock Phase Classification (Cross-Space Transfer)
A separate diblock copolymer phase-behavior dataset (Arora et al., 2021) comprising 4,780 labeled samples over five phase classes is used to test cross-space transfer. Fine-tuning regimes range from 4% (191 samples) to 80% (3,824 samples), and evaluation employs area-under-PR-curve (AUPRC) on the held-out test partition.
6. Performance Analysis and Ablation Studies
Quantitative Results
A summary of results for key tasks:
| Task | No Pretraining | Polymer-JEPA | Notes |
|---|---|---|---|
| EA regression (R², 0.4% data) | ~0.10 ± 0.15 | ~0.45 ± 0.07 | Plateau at ~4% data; both converge at ~0.90 R² |
| EA regression (R², 0.8% data) | ~0.25 | ~0.50 | |
| Diblock classification (AUPRC) | Varies (lower) | +0.02–0.10 higher | Transfer gains up to +0.05 (@80% data) |
| RF baseline (EA, 0.4–0.8%) | Outperforms JEPA | - | JEPA exceeds RF at 4%+ data |
| RF baseline (diblock) | Outperforms JEPA | - | Leveraging mole-fraction correlations |
Subgraphing ablations (EA, 0.4% regime) show optimal context size at 60% of nodes (; RMSE ), target size at 10% of nodes (), and one target patch per graph yields best performance (). Random-walk and METIS methods provide superior subgraph partitioning.
7. Interpretation, Limitations, and Future Directions
Polymer-JEPA’s embedding-space prediction compels the GNN to encode semantically rich features, focusing on global chemical dependencies, monomer sequences, and polymer topology rather than local reconstruction. The context-to-target mapping, augmented by positional RWSE tokens, facilitates learning of structural geometry inherent in polymer graphs. Cross-space transfer, evidenced by consistent AUPRC gains on diblock copolymer data, suggests the model captures generalized polymer chemistry motifs.
Performance limitations include cases where random forest baselines leveraging simple molecular statistics (e.g., mole fraction for phase prediction) can exceed JEPA, especially in ultra-low data contexts. Absence of 3D geometric information and processing-condition metadata, coupled with corpus size restrictions, define current limitations.
Proposed future directions include scaling pretraining to larger and more chemically diverse databases (e.g., PI1M), incorporation of 3D conformer and process data for richer embedding, exploration of graph-transformer context/target encoders, and the integration of contrastive learning (e.g., VICReg-JEPA) within the embedding space to further improve representation quality and transferability (Piccoli et al., 22 Jun 2025).
Polymer-JEPA thus exemplifies the utility of joint embedding predictive architectures in polymer informatics, providing measurable benefits in property prediction, especially under data-scarce regimes, and suggesting promising generalization across polymer chemical spaces.