PMechDB: Curated Mechanistic Reaction Database

Updated 25 May 2026

PMechDB is a curated database that catalogs polar elementary-step reactions with explicit electron flow and atom mapping for precise mechanistic analysis.
It provides comprehensive annotations including mechanistic classes, SMILES representations, and detailed reaction conditions to support accurate ML model training.
The dataset’s expert curation and standardized JSON format enable reproducible research and effective integration into machine-learning pipelines for reaction prediction.

PMechDB is a publicly available, manually curated database of polar elementary-step reactions, expressly constructed for mechanistic ("arrow-pushing") machine-learning applications in organic chemistry. Unlike datasets that encode only overall reactant-to-product transformations (e.g., the USPTO dataset), PMechDB represents each reaction as a discrete mechanistic step, capturing the explicit electron flow through curly-arrow notation, atom mapping, and detailed mechanistic annotations. It underpins recent advances in interpretable reaction prediction and benchmarking of both deep learning and interatomic potential methodologies for small-molecule reactivity (Miller et al., 22 Apr 2025, Varga-Umbrich et al., 13 May 2026).

1. Definition, Scope, and Design Principles

PMechDB catalogs polar elementary-step reactions limited to processes formally describable as paired-electron (two-electron) events. Reactions covered include proton transfers, nucleophilic additions (e.g., 1,2-addition to carbonyls), eliminations (E1, E2), nucleophilic substitutions (SN1, SN2), heteroatom/group transfers, and rearrangements via polar intermediates. Each entry is explicitly classified by mechanistic class (e.g., "proton_transfer", "nucleophile_attack") and roles of molecular participants (acid, base, nucleophile, electrophile, leaving group).

Electron movement is precisely mapped via pairs of atom map indices, directly corresponding to curly arrows in conventional arrow-pushing schemes. For example, in a protonation step where a lone pair on O (atom 12) attacks H (atom 5), and the O–H bond forms back to O, the "arrow_codes" field is [[12,5],[5,12]], encoding full electron-flow topology at the atomistic level.

Key features:

Strict mechanistic granularity: Each entry = one mechanistic (polar) step, not an overall multistep transformation.
Comprehensive annotation: Mechanistic class, SMILES with standardized atom mapping, reagents, conditions (solvent, temperature, pH), intermediate states, charge balance, and explicit curly-arrow electron-flow.
Expert curation: All entries are annotated by expert organic chemists to ensure chemical plausibility and atom/electron conservation.
Data format: Distributed in JSON for integration into Python-based workflows.

2. Data Schema and Representation

Each PMechDB entry is a JSON object with the following core fields:

reaction_id: Unique string identifier (e.g., "PMechDB_002134")
reactants: Array of SMILES strings, atom-mapped ●
reagents: Array of SMILES strings (if applicable)
conditions: Object specifying experimental context (optional)
mechanism_class: Mechanistic label ("sn2_substitution", "proton_transfer", etc.)
atom_mapping: Mapping of SMILES indices to global atom IDs
arrow_codes: List of [source, sink] atom index pairs representing electron move events
intermediate_smiles: SMILES of any discrete intermediate produced (optional)
products: Array of SMILES strings, atom-mapped
charge_balance: Object reporting net charge on reactants/products (must match)
split: "train", "val", or "test" for reproducible ML evaluation

Entries are balanced, partially atom-mapped, and SMILES representations adhere to the Weininger convention. Example (SN2 chloride displacement):

{
  "reaction_id": "PMechDB_002134",
  "reactants": ["C[C@H](Cl)O", "Br-"],
  "reagents": [],
  "conditions": {"solvent":"DMF","temp":"298K"},
  "mechanism_class":"sn2_substitution",
  "atom_mapping":{"C@H":1,"Cl":2,"Br":3,"O":4},
  "arrow_codes":[[3,2],[2,4]],
  "intermediate_smiles":null,
  "products":["C[C@H](Br)O","Cl-"],
  "charge_balance":{"reactants":0,"products":0},
  "split":"train"
}

3. Dataset Statistics, Splits, and Augmentation

PMechDB comprises approximately 13,000 manually curated polar elementary steps. The set is fully balanced and atom-mapped, with an illustrative mechanistic composition:

Mechanistic Class	Proportion
Proton transfer	~34%
Nucleophilic attack on carbonyl	~22%
SN2 substitution	~14%
E2 elimination	~8%
Other heteroatom transfers/rearrangements	~22%

Standard splits are random 80/10/10 for training (~10,400), validation (~1,300), and test (~1,300) on the curated subset.

To expand coverage, notably for proton transfer, a combinatorial strategy generated ~48 million plausible proton-transfer reactions based on comprehensive acid/base collections and estimated kinetics (Eigen/Bernasconi relationships). From these, 10,000 reactions meeting specific rate thresholds were randomly sampled and incorporated only into the training set, yielding a "mixed" split for machine-learning applications (Miller et al., 22 Apr 2025). This controlled augmentation allows improved generalization for ML models, especially in proton-transfer-rich subspaces.

4. Usage in Machine Learning and Benchmarking

PMechDB is designed for seamless integration into Python-based ML pipelines. Its explicit mapping and annotation support workflows for feature engineering, graph construction, and deep learning input generation:

Feature extraction: Parsing atom-mapped SMILES, encoding electron movement, mechanistic features.
PyTorch dataloaders: Custom dataset objects return reactant/reagent/product SMILES and arrow codes for each sample.
Mechanism-based querying: Filtering by mechanism_class enables targeted training and analysis.
Evaluation: Models can be assessed on standard splits or full-pathway benchmarks.

Benchmarks include transformer-based architectures (Molecular Transformer, Chemformer, T5Chem), graph-based models (Graph2SMILES), and two-step Siamese networks for sequential prediction (reactive atom identification followed by arrow enumeration). The hybrid model—Chemformer ensemble filtered by two-step Siamese ranking—achieves state-of-the-art top-10 product SMILES prediction accuracy (94.9%) and recovers 84.9% of textbook multi-step mechanistic pathways (Miller et al., 22 Apr 2025).

Performance metrics:

Metric	Best Model (Hybrid, Mixed Split)
Top-1 accuracy	39.5% (Siamese)
Top-3 accuracy	59.6% (Siamese)
Top-5 accuracy	68.2% (Siamese)
Top-10 accuracy	94.9% (Hybrid)
Multi-step pathway recovery	84.9% (Hybrid)
Reactive atom top-1/3/5 hit rate	55.4% / 86.2% / 91.1%

5. Role in Energy/Force Benchmarks and Active Learning

PMechDB is used as a reactivity benchmark for machine-learning interatomic potentials (MLIPs), particularly with DFT-computed energies and per-atom forces for structures sampled along reaction trajectories (Varga-Umbrich et al., 13 May 2026). A representative HCNO-subset (10,000 structures) is employed to evaluate active learning strategies leveraging energy and force supervision.

Key innovations include force-aware Neural Tangent Kernel (NTK) representations for scalable, efficient acquisition in large candidate pools. For PMechDB, NTK-E (energy-based) methods marginally outperform NTK-F (force) and NTK-EF (joint energy-force), a direct result of strong energy–force covariance along reaction paths. The pool's homogeneity reduces the benefit of force-aware selection; the area-under-curve reduction in energy RMSE relative to random is −5.9% (NTK-E), and final energy RMSE improves to ~16.0 meV/atom.

Methodological parameters specific to the PMechDB active learning benchmark:

Parameter	Value
Candidate pool size	10,000
Training seed	50
Acquisition rounds	20 × 150 selections
Embedding dimension (d)	~1,920 (MACE model)
Batch selection method	Chunked PV + LCMD
NTK regularization	λ = 1e−6

This suggests that, for datasets of chemically homogeneous trajectories, energy-focused active learning is largely sufficient; force information adds little until chemical diversity or geometric heterogeneity increases.

6. Limitations, Recommendations, and Future Directions

Documented limitations include:

Homogeneity: The curated scope is mainly small-molecule HCNO chemistry and polar elementary steps, limiting diversity—especially compared to more heterogeneous datasets (e.g. transition-metal-catalyzed or solid-state systems).
Augmentation bandwidth: While augmented with combinatorial proton transfer steps, augmentation is not exhaustive for other mechanisms.
Force-awareness: Methods incorporating per-atom force information (e.g. NTK-F, NTK-EF) do not outperform energy-only methods for PMechDB-type pools, unless structural/geometric diversity is present (Varga-Umbrich et al., 13 May 2026).
Energy–force weighting: The w_E:w_F mixing parameter in joint NTKs is heuristic and not rigorously determined; adaptive or signal-normalized techniques remain an open area for improvement.

Recommendations for practitioners:

Employ energy-NTK acquisition as default for reactivity pools resembling PMechDB.
For larger, more diverse or strongly non-equilibrium datasets, consider increasing the role of force-based or joint NTK acquisition.
Pretrained activation embeddings offer a cost-effective alternative to NTK when computational resources are limited.
For MLIP fine-tuning, chunked posterior-variance (PV) shortlisting plus feature-space diversity sampling (LCMD) delivers linear scaling up to ~200k candidates and is well-suited for standard GPUs.

Limitations are primarily in coverage and diversity—a plausible implication is that expanding PMechDB to incorporate broader reaction types and heavier elements will amplify its utility for both mechanistic prediction and interatomic potential training (Miller et al., 22 Apr 2025, Varga-Umbrich et al., 13 May 2026).

7. Data Access and Integration

PMechDB is downloadable in JSON format at https://deeprxn.ics.uci.edu/pmechdb/download with usage guidelines for integration into ML workflows. Python code for dataset ingestion, SMILES parsing, and querying by mechanistic class is provided in (Miller et al., 22 Apr 2025). Example usage:

import json
from rdkit import Chem

with open('PMechDB.json') as f:
    data = json.load(f)

train = [r for r in data if r['split']=='train']

nucleophiles = [r for r in data if r['mechanism_class']=='nucleophilic_attack']

for entry in train:
    reactant_mols = [Chem.MolFromSmiles(s) for s in entry['reactants']]
    product_mols  = [Chem.MolFromSmiles(s) for s in entry['products']]
    arrows = entry['arrow_codes']

The dataset’s granularity, mechanistic detail, and open accessibility have established PMechDB as a central resource for reaction prediction, mechanistic AI, and molecular simulation research (Miller et al., 22 Apr 2025, Varga-Umbrich et al., 13 May 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Interpretable Deep Learning for Polar Mechanistic Reaction Prediction (2025)

Force-Aware Neural Tangent Kernels for Scalable and Robust Active Learning of MLIPs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PMechDB.

PMechDB: Curated Mechanistic Reaction Database

1. Definition, Scope, and Design Principles

2. Data Schema and Representation

3. Dataset Statistics, Splits, and Augmentation

4. Usage in Machine Learning and Benchmarking

5. Role in Energy/Force Benchmarks and Active Learning

6. Limitations, Recommendations, and Future Directions

7. Data Access and Integration

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PMechDB: Curated Mechanistic Reaction Database

1. Definition, Scope, and Design Principles

2. Data Schema and Representation

3. Dataset Statistics, Splits, and Augmentation

4. Usage in Machine Learning and Benchmarking

5. Role in Energy/Force Benchmarks and Active Learning

6. Limitations, Recommendations, and Future Directions

7. Data Access and Integration

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research