Papers
Topics
Authors
Recent
2000 character limit reached

OPoly26 Open Polymers 2026 Dataset

Updated 5 January 2026
  • The Open Polymers 2026 dataset is a comprehensive resource capturing monomer diversity, chain architectures, polymerization degrees, and solvation effects in polymers.
  • It employs single-point DFT calculations at the ωB97M-V/def2-TZVPD level with MD-based sampling to generate reliable energy and force data.
  • Benchmarking shows that using OPoly26 significantly improves ML model accuracy for polymers, achieving sub-kcal/mol error and robust out-of-distribution performance.

The Open Polymers 2026 (OPoly26) dataset is a large-scale, high-fidelity resource composed of over 6.57 million single-point density functional theory (DFT) calculations on capped substructures of diverse polymer chains. Designed to address the historic paucity of quantum chemical reference data for macromolecular systems, OPoly26 systematically captures the monomer diversity, chain architectures, degree of polymerization, and solvation effects intrinsic to both biological and synthetic polymers. All quantum chemical datapoints are computed at the ωB97M-V/def2-TZVPD level and are directly accessible for training, validation, and benchmarking of machine learning interatomic potentials (MLIPs) and related atomistic models (Levine et al., 28 Dec 2025).

1. Dataset Composition and Scope

OPoly26 consists of 6,573,734 single-point DFT calculations representing clusters of up to 360 atoms, with most structures containing fewer than 250 atoms. Collectively, the dataset encompasses approximately 1.2 billion atoms. Substructures are hydrogen-capped fragments of polymer chains with degrees of polymerization ranging from ~20 to ~500 repeat-unit atoms per cluster, sampled from full-chain systems of 300–5000 atoms.

Monomer diversity is a core design feature. The dataset spans 2444 unique repeat units from six principal classes:

  • Traditional synthetic homopolymers, such as styrenes, acrylates, and polyurethanes
  • Fluoropolymers (perfluoroalkyl substances, PFAS-like, via OMG templates)
  • Conjugated “optical” polymers
  • Polymer electrolytes (~300 backbone chemistries with explicit Li⁺, PF₆⁻, etc.)
  • Peptoids (N-substituted glycine polymers)
  • Lipid-like amphiphiles

Chain architectures in OPoly26 include linear homopolymers, alternating and random copolymers (notably high-entropy architectures with 4–10 monomers per sequence), solvated single-chain systems in 17 explicit solvents, and ion-inserted systems (20 ions per simulation cell). Notably, the dataset does not include branched, crosslinked, graft, block, or gradient architectures.

Solvation environments are represented via explicit solvent boxes, each consisting of a single ~500-atom chain and ~4500 solvent atoms, across 17 canonical solvents (including water, acetone, toluene, and THF). GAFF2 force fields (with modified fluorocarbon parameters) are used for all solvents except water, which employs TIP3P-Ewald. Packmol packing and RESP charges are implemented under the infinite dilution approximation.

2. Quantum Chemical Methodology

All DFT calculations utilize the range-separated hybrid meta-GGA functional ωB97M-V (incorporating nonlocal VV10 dispersion) and the def2-TZVPD basis set (triple-zeta valence with diffuse functions on heavy atoms). Integration leverages the COSX and RI-J approximations with ORCA 6.0.0, employing DEFGRID3 angular and pruned grid settings.

Structural diversity is achieved through molecular dynamics (MD) and sampling protocols, including classical (GAFF2) MD, MLIP-based MD, DFTB-based MD, and AFIR-driven reactive searches. Importantly, no DFT-based geometry optimization is performed; all computations are single-point energy and force evaluations on capped fragments.

The dataset provides, for each substructure: total DFT energy EE (in eV), atomic forces FiF_i (in eV/Å), HOMO and LUMO orbital energies (εHOMO,εLUMO)(\varepsilon_{\mathrm{HOMO}}, \varepsilon_{\mathrm{LUMO}}), orbital gap ΔεεLUMOεHOMO\Delta \varepsilon \equiv \varepsilon_{\mathrm{LUMO}} - \varepsilon_{\mathrm{HOMO}}, Mulliken and Löwdin charge/spin populations, NBO charges/spins for systems 70\leq70 atoms, spin expectation S2\langle S^2 \rangle, number of SCF steps, electron count, and basis function count. Dipole moments μ\mu for each snapshot are available via ORCA output.

The central Kohn–Sham DFT energy expression is:

EDFT[ρ]=Ts[ρ]+Vne[ρ]+J[ρ]+Exc[ρ]E_{\rm DFT}[\rho] = T_s[\rho] + V_{\rm ne}[\rho] + J[\rho] + E_{\rm xc}[\rho]

with ρ\rho denoting the electron density, TsT_s the noninteracting kinetic energy, VneV_{\rm ne} the nuclear–electron attraction, JJ the classical Coulomb repulsion, and ExcE_{\rm xc} the exchange–correlation functional.

3. Data Organization and Accessibility

Each DFT reference is stored in a linked JSON and XYZ (or PDB) file containing all electronic structure data, metadata, atom coordinates, unique identifier, polymer/solvent/ion tags, and structural annotations. Complete ORCA log and .gbw wavefunction files are slated for future release.

Recommended train/validation/test splits are:

  • Training: 6,099,878 substructures
  • Validation: 210,924 substructures
  • Test: 259,740 substructures (composition-based held-out split)

Out-of-distribution (OOD) assessments are facilitated via: (a) DFTB-MD homopolymer test sets (300-atom cells from DFTB MD trajectories), and (b) Si-polymer radiation degradation tests (with no Si-containing examples present during training).

Licensing is under CC-BY-4.0, and the full dataset, sample code, and model training scripts are available through the HuggingFace OMol25 portal (https://huggingface.co/facebook/OMol25). Molecular simulation and extraction codebases, including MD workflows and AFIR pipelines, are maintained under the FAIR-CHEM repository (https://github.com/facebookresearch/fairchem).

4. Benchmarking and Machine Learning Utility

Benchmarking leverages the eSEN equivariant message-passing neural network (Wood et al., 2025), trained (1) on the OMol25 dataset only, (2) on OPoly26 only, and (3) jointly on OMol25 and OPoly26 with step-matched data counts. Energy and force mean absolute errors (MAEs) are reported on polymer composition and OOD tasks:

Train Dataset Test Comp. (E, F) DFTB Test (E, F) Si-Polymer (E, F)
OMol25 only 78.3 meV, 6.5 meV/Å 30.0 meV, 3.8 meV/Å 184.2 meV, 5.3 meV/Å
OPoly26 only 29.7 meV, 5.7 meV/Å 31.2 meV, 4.52 meV/Å ---
OMol25 + OPoly26 32.7 meV, 5.20 meV/Å 30.8 meV, 3.97 meV/Å 160.4 meV, 6.0 meV/Å

Inclusion of OPoly26 reduces polymer energy MAE from ~78 meV to ~32 meV, achieving sub-kcal/mol accuracy; force MAEs decrease from 6.5 to 5.2 meV/Å. Importantly, joint training with OMol25 and OPoly26 maintains low error on diverse small-molecule and protein–ligand benchmarks, with no significant degradation on OMol25 tasks, as shown below.

Train Dataset Ligand Strain ΔE/RMSD Conformer Ensemble Protein–Ligand IxE/F
OMol25 only 4.18 meV / 0.21 Å 0.03 Å / 4.55 meV 166.3 meV / 4.41
OPoly26 only 17.6 meV / 0.32 Å --- ---
OMol25 + OPoly26 5.09 meV / 0.20 Å 0.04 Å / 5.26 meV 191.9 meV / 4.85

A key observation is that combined training yields consistent or improved OOD generalization, especially for Si-polymers (energy MAE improved by ~15%), and robust performance on DFTB-derived homopolymer test sets.

5. Research Applications and Integration in ML Workflows

OPoly26 is positioned as a universal backbone for ML model development in atomistic polymer science. Primary use cases include:

  • Transferable MLIP training: Simultaneous training with OPoly26 and small-molecule datasets (OMol25) for broad chemical coverage.
  • Polymer fine-tuning: Pretraining on universally diverse data and subsequent fine-tuning on OPoly26 is recommended where high polymer-target accuracy is required.
  • High-throughput MD simulation: Training MLIPs on OPoly26 facilitates molecular dynamics on 10³–10⁴ atom cells, approaching DFT accuracy at substantially increased throughput.
  • Reactivity and solvation effects: Datasets derived from AFIR-driven trajectories and explicit-solvent systems enable ML models to handle reactive, solvated, and electrolyte environments.

Best practices endorse use of the prescribed data splits and comprehensive reporting of both in-distribution and OOD performance, especially on the specialized held-out test sets (e.g., Si-polymers, DFTB-MD), for consistent benchmarking.

6. Limitations and Considerations

OPoly26, while extensive, excludes certain chain architectures (branched, crosslinked, graft, block, and gradient polymers), which constrains its direct applicability to those systems. The explicit solvent set is restricted to 17 canonical solvents under infinite-dilution conditions, employing standard force fields. DFT snapshot geometries are derived from various MD protocols without quantum-level optimization, potentially introducing discrepancies for high-energy or out-of-equilibrium conformations.

A plausible implication is that researchers interested in these excluded chain types or unsampled solvent systems may need to extend the OPoly26 workflows or generate analogous DFT datasets for their use cases. Furthermore, while the dataset substantially improves ML model accuracy for polymeric materials, its performance on non-polymer molecular benchmarks must be carefully monitored in mixed-dataset training regimes to ensure model robustness across chemical space (Levine et al., 28 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Open Polymers 2026 (OPoly26) Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube