RMechDB: Radical Mechanism Database
- RMechDB is a publicly available, expertly curated database of 5,500 fully balanced, atom-mapped radical elementary steps that encode explicit electron flow and orbital details.
- It employs the OrbChain formalism to represent one-electron movements with fish-hook arrows and ensures mechanistic consistency through rigorous atom mapping.
- The database underpins interpretable ML predictors like RMechRP and supports pathway enumeration in applications ranging from atmospheric chemistry to synthetic planning.
RMechDB is a publicly available, expertly curated database of elementary radical reaction steps, purpose-built to encode mechanistic detail—including atom mapping and explicit radical arrow-pushing—that is absent from broad, patent-derived chemical reaction corpora. RMechDB enables both mechanistic modeling and benchmarking of deep-learning predictors specific to radical reaction chemistry, with contextual annotation of molecular orbitals, mechanistic steps, and electron flow for over 5,500 reactions spanning textbook paradigms and modern atmospheric chemistry (Tavakoli et al., 2023).
1. Composition, Scope, and Curation
RMechDB (Radical Mechanism Database) v1.0 comprises ~5,500 fully balanced, atom-mapped radical elementary steps. Reactions are sourced from both canonical textbook radical mechanisms—such as homolytic cleavages, radical additions, and hydrogen abstractions—and from primary literature detailing atmospheric oxidation processes (notably those involving hydroxyl, peroxy, and alkoxy radicals). The database catalogs over 1,000 unique radical species, including organic, inorganic, and mixed-phase radicals.
Every entry is labeled by:
- Defined reaction class (e.g., H-abstraction, addition, β-scission, recombination)
- Number of fish-hook arrows (always two per elementary step)
- Reactive molecular orbital (MO) pair, with explicit topological and electronic center annotations
Reactions are represented using the OrbChain formalism, comprising:
- and as reactant and product molecular graphs (nodes: atom labels, edges: bond orders)
- as a set of directed half-arrows (one-electron fish-hook arrows encoding electron flow)
- Atom mappings from to , with unique arrow codes specifying MO identities
Curation by expert chemists included manual verification of atom mapping, arrow-pushing mechanisms, adherence to mass/electron balance, and consistency checks (e.g., avoidance of topologies violating Bredt’s rule) (Tavakoli et al., 2023).
2. Data Structure, Representation, and Accessibility
RMechDB is distributed in both machine- and human-readable forms:
- JSON: for each entry, provides atom-mapped SMILES for reactants/products (with isotopic radical labels), arrow-code strings (e.g., "2 ⟨– 1 ⟨–"), and OrbChain MO descriptors.
- SDF (RDKit-compatible): enables substructure search and molecular fingerprinting.
The database is partitioned into standardized training and test splits, used by Tavakoli et al. and subsequent benchmarking efforts:
| Subset | Train | Test |
|---|---|---|
| Core (textbook) | 1,512 | 150 |
| Atmospheric (specific) | 3,397 | 367 |
| Combined | 4,909 | 517 |
No separate validation set is provided; five-fold cross-validation is performed during hyperparameter tuning. Preprocessing involves canonicalization of SMILES, removal of atom mapping and arrow codes for text-based models, and generation of negative samples for contrastive learning (Tavakoli et al., 2023).
3. Mechanistic Encoding: OrbChain Formalism
Each mechanistic step is fully specified at the orbital level:
- Fish-hook arrows represent individual one-electron movements.
- Atom mapping is explicit for every atom, ensuring mechanistic and mass/electron balance.
- Reactive MO pairs: For each step, the involved MOs (specified by atom, environment, electron count, and connectivity) are annotated, ensuring a bijective mapping between reactants, products, electron movement, and orbital transformations.
This explicit mechanistic mapping enables unambiguous translation between graph-level reaction representations and quantum-chemical mechanistic models.
4. Benchmarking Radical Reaction Prediction: RMechRP
RMechDB underpins RMechRP, a radical mechanistic reaction predictor designed for high interpretability and mechanistic accuracy. RMechRP incorporates three principal model types:
- Two-step predictor: Identifies reactive sites via node classification (using atom descriptors or GNN) followed by mechanism ranking using a Siamese network.
- Contrastive mechanistic learner: Scores pairs of atoms via two-tower MLPs and interaction scoring, trained with a contrastive loss:
where .
- Rxn-Hypergraph attention model: Learns atomic embeddings directly from molecular hypergraphs, integrating with the same contrastive loss scheme.
- Text-based (seq2seq) molecular transformer: Pretrained on USPTO data, fine-tuned using RMechDB entries, with SMILES-level tokenization; arrow codes are omitted in text-only models.
For model evaluation, top-N "mechanistic-step accuracy" is the principal metric: the probability that the correct reaction mechanism is ranked in the top N predictions.
The performance metrics for the main models are:
| Model | Core Top1 | Core Top5 | Atm Top1 | Atm Top5 | Time (s) |
|---|---|---|---|---|---|
| Two-step (best) | 62.4% | 93.2% | 60.4% | 91.6% | 1.38 |
| Contrastive Rxn-HG | 64.3% | 95.1% | 62.1% | 94.1% | 1.45 |
| Contrastive Atom-desc | 62.9% | 94.2% | 61.0% | 93.0% | 0.08 |
| Seq2Seq fine-tuned | 57.7% | 83.9% | 57.1% | 82.2% | 1.30 |
Contrastive Rxn-Hypergraph models achieve the highest recall across top-N metrics. The two-step pipeline performs similarly with slower inference. Text-only models lag by 5–10 percentage points and do not improve with fine-tuning solely on RMechDB data (Tavakoli et al., 2023).
5. Interpretability and Mechanistic Applications
RMechDB’s elementary-step granularity confers several interpretability features to trained predictors:
- Orbital-level predictions: Each predicted mechanistic step directly specifies the orbital pair and fish-hook arrow, suitable for mapping to SMIRKS templates.
- Pathway enumeration: Iterating single-step predictions enables construction of fully mass-/atom-balanced radical pathway trees, capturing side products and mechanistic branching not accessible to black-box overall transformation models.
- Traceable atom mapping: Atom-level accuracy ensures all intermediates are chemically valid and balanced.
In applied settings, RMechDB and RMechRP facilitate:
- Targeted pathway search in atmospheric chemistry (e.g., VOC oxidation, where RMechRP achieved 60% retrieval of known Master Chemical Mechanism products within 2 s per pathway under specified search breadth/depth)
- Radical-mediated polymerization mechanistic design
- Prediction of radical intermediates in enzymatic and bioinorganic systems
- Exploratory synthesis planning where traditional polar-based template approaches are inadequate
The database and predictors are accessible via online interfaces for both single-step and pathway-level tasks (Tavakoli et al., 2023).
6. Broader Impact and Positioning within Mechanistic Databases
RMechDB fills a critical gap in publicly available, mechanistically annotated radical reaction data. Unlike USPTO-based datasets, it encodes explicit electron flow, supports orbital-level mechanistic learning, and covers both textbook and state-of-the-art atmospheric radical chemistry. This enables rigorous benchmarking for next-generation radical reaction models, supports the development of interpretable ML-based predictors such as RMechRP, and provides infrastructure for mechanistic cross-comparisons with other classes of databases (e.g., polar or pericyclic mechanisms).
A plausible implication is that continued expansion of RMechDB could drive advances in interpretable, orbital-level reaction prediction for heterogeneous molecular domains overlooked by traditional retrosynthetic datasets (Tavakoli et al., 2023).