PMechDB: Curated Mechanistic Reaction Database
- PMechDB is a curated database that catalogs polar elementary-step reactions with explicit electron flow and atom mapping for precise mechanistic analysis.
- It provides comprehensive annotations including mechanistic classes, SMILES representations, and detailed reaction conditions to support accurate ML model training.
- The dataset’s expert curation and standardized JSON format enable reproducible research and effective integration into machine-learning pipelines for reaction prediction.
PMechDB is a publicly available, manually curated database of polar elementary-step reactions, expressly constructed for mechanistic ("arrow-pushing") machine-learning applications in organic chemistry. Unlike datasets that encode only overall reactant-to-product transformations (e.g., the USPTO dataset), PMechDB represents each reaction as a discrete mechanistic step, capturing the explicit electron flow through curly-arrow notation, atom mapping, and detailed mechanistic annotations. It underpins recent advances in interpretable reaction prediction and benchmarking of both deep learning and interatomic potential methodologies for small-molecule reactivity (Miller et al., 22 Apr 2025, Varga-Umbrich et al., 13 May 2026).
1. Definition, Scope, and Design Principles
PMechDB catalogs polar elementary-step reactions limited to processes formally describable as paired-electron (two-electron) events. Reactions covered include proton transfers, nucleophilic additions (e.g., 1,2-addition to carbonyls), eliminations (E1, E2), nucleophilic substitutions (SN1, SN2), heteroatom/group transfers, and rearrangements via polar intermediates. Each entry is explicitly classified by mechanistic class (e.g., "proton_transfer", "nucleophile_attack") and roles of molecular participants (acid, base, nucleophile, electrophile, leaving group).
Electron movement is precisely mapped via pairs of atom map indices, directly corresponding to curly arrows in conventional arrow-pushing schemes. For example, in a protonation step where a lone pair on O (atom 12) attacks H (atom 5), and the O–H bond forms back to O, the "arrow_codes" field is [[12,5],[5,12]], encoding full electron-flow topology at the atomistic level.
Key features:
- Strict mechanistic granularity: Each entry = one mechanistic (polar) step, not an overall multistep transformation.
- Comprehensive annotation: Mechanistic class, SMILES with standardized atom mapping, reagents, conditions (solvent, temperature, pH), intermediate states, charge balance, and explicit curly-arrow electron-flow.
- Expert curation: All entries are annotated by expert organic chemists to ensure chemical plausibility and atom/electron conservation.
- Data format: Distributed in JSON for integration into Python-based workflows.
2. Data Schema and Representation
Each PMechDB entry is a JSON object with the following core fields:
- reaction_id: Unique string identifier (e.g., "PMechDB_002134")
- reactants: Array of SMILES strings, atom-mapped ●
- reagents: Array of SMILES strings (if applicable)
- conditions: Object specifying experimental context (optional)
- mechanism_class: Mechanistic label ("sn2_substitution", "proton_transfer", etc.)
- atom_mapping: Mapping of SMILES indices to global atom IDs
- arrow_codes: List of [source, sink] atom index pairs representing electron move events
- intermediate_smiles: SMILES of any discrete intermediate produced (optional)
- products: Array of SMILES strings, atom-mapped
- charge_balance: Object reporting net charge on reactants/products (must match)
- split: "train", "val", or "test" for reproducible ML evaluation
Entries are balanced, partially atom-mapped, and SMILES representations adhere to the Weininger convention. Example (SN2 chloride displacement):
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{
"reaction_id": "PMechDB_002134",
"reactants": ["C[C@H](Cl)O", "Br-"],
"reagents": [],
"conditions": {"solvent":"DMF","temp":"298K"},
"mechanism_class":"sn2_substitution",
"atom_mapping":{"C@H":1,"Cl":2,"Br":3,"O":4},
"arrow_codes":[[3,2],[2,4]],
"intermediate_smiles":null,
"products":["C[C@H](Br)O","Cl-"],
"charge_balance":{"reactants":0,"products":0},
"split":"train"
} |
3. Dataset Statistics, Splits, and Augmentation
PMechDB comprises approximately 13,000 manually curated polar elementary steps. The set is fully balanced and atom-mapped, with an illustrative mechanistic composition:
| Mechanistic Class | Proportion |
|---|---|
| Proton transfer | ~34% |
| Nucleophilic attack on carbonyl | ~22% |
| SN2 substitution | ~14% |
| E2 elimination | ~8% |
| Other heteroatom transfers/rearrangements | ~22% |
Standard splits are random 80/10/10 for training (~10,400), validation (~1,300), and test (~1,300) on the curated subset.
To expand coverage, notably for proton transfer, a combinatorial strategy generated ~48 million plausible proton-transfer reactions based on comprehensive acid/base collections and estimated kinetics (Eigen/Bernasconi relationships). From these, 10,000 reactions meeting specific rate thresholds were randomly sampled and incorporated only into the training set, yielding a "mixed" split for machine-learning applications (Miller et al., 22 Apr 2025). This controlled augmentation allows improved generalization for ML models, especially in proton-transfer-rich subspaces.
4. Usage in Machine Learning and Benchmarking
PMechDB is designed for seamless integration into Python-based ML pipelines. Its explicit mapping and annotation support workflows for feature engineering, graph construction, and deep learning input generation:
- Feature extraction: Parsing atom-mapped SMILES, encoding electron movement, mechanistic features.
- PyTorch dataloaders: Custom dataset objects return reactant/reagent/product SMILES and arrow codes for each sample.
- Mechanism-based querying: Filtering by mechanism_class enables targeted training and analysis.
- Evaluation: Models can be assessed on standard splits or full-pathway benchmarks.
Benchmarks include transformer-based architectures (Molecular Transformer, Chemformer, T5Chem), graph-based models (Graph2SMILES), and two-step Siamese networks for sequential prediction (reactive atom identification followed by arrow enumeration). The hybrid model—Chemformer ensemble filtered by two-step Siamese ranking—achieves state-of-the-art top-10 product SMILES prediction accuracy (94.9%) and recovers 84.9% of textbook multi-step mechanistic pathways (Miller et al., 22 Apr 2025).
Performance metrics:
| Metric | Best Model (Hybrid, Mixed Split) |
|---|---|
| Top-1 accuracy | 39.5% (Siamese) |
| Top-3 accuracy | 59.6% (Siamese) |
| Top-5 accuracy | 68.2% (Siamese) |
| Top-10 accuracy | 94.9% (Hybrid) |
| Multi-step pathway recovery | 84.9% (Hybrid) |
| Reactive atom top-1/3/5 hit rate | 55.4% / 86.2% / 91.1% |
5. Role in Energy/Force Benchmarks and Active Learning
PMechDB is used as a reactivity benchmark for machine-learning interatomic potentials (MLIPs), particularly with DFT-computed energies and per-atom forces for structures sampled along reaction trajectories (Varga-Umbrich et al., 13 May 2026). A representative HCNO-subset (10,000 structures) is employed to evaluate active learning strategies leveraging energy and force supervision.
Key innovations include force-aware Neural Tangent Kernel (NTK) representations for scalable, efficient acquisition in large candidate pools. For PMechDB, NTK-E (energy-based) methods marginally outperform NTK-F (force) and NTK-EF (joint energy-force), a direct result of strong energy–force covariance along reaction paths. The pool's homogeneity reduces the benefit of force-aware selection; the area-under-curve reduction in energy RMSE relative to random is −5.9% (NTK-E), and final energy RMSE improves to ~16.0 meV/atom.
Methodological parameters specific to the PMechDB active learning benchmark:
| Parameter | Value |
|---|---|
| Candidate pool size | 10,000 |
| Training seed | 50 |
| Acquisition rounds | 20 × 150 selections |
| Embedding dimension (d) | ~1,920 (MACE model) |
| Batch selection method | Chunked PV + LCMD |
| NTK regularization | λ = 1e−6 |
This suggests that, for datasets of chemically homogeneous trajectories, energy-focused active learning is largely sufficient; force information adds little until chemical diversity or geometric heterogeneity increases.
6. Limitations, Recommendations, and Future Directions
Documented limitations include:
- Homogeneity: The curated scope is mainly small-molecule HCNO chemistry and polar elementary steps, limiting diversity—especially compared to more heterogeneous datasets (e.g. transition-metal-catalyzed or solid-state systems).
- Augmentation bandwidth: While augmented with combinatorial proton transfer steps, augmentation is not exhaustive for other mechanisms.
- Force-awareness: Methods incorporating per-atom force information (e.g. NTK-F, NTK-EF) do not outperform energy-only methods for PMechDB-type pools, unless structural/geometric diversity is present (Varga-Umbrich et al., 13 May 2026).
- Energy–force weighting: The w_E:w_F mixing parameter in joint NTKs is heuristic and not rigorously determined; adaptive or signal-normalized techniques remain an open area for improvement.
Recommendations for practitioners:
- Employ energy-NTK acquisition as default for reactivity pools resembling PMechDB.
- For larger, more diverse or strongly non-equilibrium datasets, consider increasing the role of force-based or joint NTK acquisition.
- Pretrained activation embeddings offer a cost-effective alternative to NTK when computational resources are limited.
- For MLIP fine-tuning, chunked posterior-variance (PV) shortlisting plus feature-space diversity sampling (LCMD) delivers linear scaling up to ~200k candidates and is well-suited for standard GPUs.
Limitations are primarily in coverage and diversity—a plausible implication is that expanding PMechDB to incorporate broader reaction types and heavier elements will amplify its utility for both mechanistic prediction and interatomic potential training (Miller et al., 22 Apr 2025, Varga-Umbrich et al., 13 May 2026).
7. Data Access and Integration
PMechDB is downloadable in JSON format at https://deeprxn.ics.uci.edu/pmechdb/download with usage guidelines for integration into ML workflows. Python code for dataset ingestion, SMILES parsing, and querying by mechanistic class is provided in (Miller et al., 22 Apr 2025). Example usage:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import json from rdkit import Chem with open('PMechDB.json') as f: data = json.load(f) train = [r for r in data if r['split']=='train'] nucleophiles = [r for r in data if r['mechanism_class']=='nucleophilic_attack'] for entry in train: reactant_mols = [Chem.MolFromSmiles(s) for s in entry['reactants']] product_mols = [Chem.MolFromSmiles(s) for s in entry['products']] arrows = entry['arrow_codes'] |
The dataset’s granularity, mechanistic detail, and open accessibility have established PMechDB as a central resource for reaction prediction, mechanistic AI, and molecular simulation research (Miller et al., 22 Apr 2025, Varga-Umbrich et al., 13 May 2026).