MechSMILES: Reaction Mechanism Notation
- MechSMILES is a rigorous representation for reaction mechanisms that encodes molecular connectivity and explicit electron flow using an arrow-pushing formalism.
- It enforces strict mass and charge conservation, ensuring each mechanistic step remains chemically valid.
- MechSMILES enhances computer-assisted synthesis planning by providing a compact, human- and machine-readable format ideal for training language models.
MechSMILES is a rigorously defined textual representation for chemical reaction mechanisms that encodes both molecular connectivity and explicit electron flow using a minimal, human- and machine-readable syntax. Developed to bridge the mechanistic explainability gap in computer-assisted synthesis planning (CASP) systems, MechSMILES leverages arrow-pushing formalism to represent elementary mechanistic steps with strict enforcement of mass and charge conservation. The format provides a compact codification of mechanistic moves that directly supports training and evaluation of LLMs for mechanism prediction and analysis (Neukomm et al., 5 Dec 2025).
1. Formal Syntax and Grammatical Specification
The fundamental MechSMILES unit encodes a single elementary mechanistic step as a pairing of a fully atom-mapped SMILES string with a semicolon-separated list of explicit electron flow "arrows." The formal grammar, using LaTeX BNF notation, is:
$\begin{array}{rcl} \langle\text{MechSMILES}\rangle &\Coloneqq& \langle\text{MappedSMILES}\rangle\;|\;\langle\text{ArrowSeq}\rangle \[6pt] \langle\text{ArrowSeq}\rangle &\Coloneqq& \langle\text{Arrow}\rangle \bigl(\,\texttt{;}\;\langle\text{Arrow}\rangle\bigr)^{*} \[4pt] \langle\text{Arrow}\rangle &\Coloneqq& \langle\text{Attack}\rangle\;|\;\langle\text{Ionization}\rangle\;|\;\langle\text{BondAttack}\rangle \[4pt] \langle\text{Attack}\rangle &\Coloneqq& \texttt{(}\,\langle\text{AtomIdx}\rangle\,\texttt{,}\,\langle\text{AtomIdx}\rangle\,\texttt{)} \[3pt] \langle\text{Ionization}\rangle &\Coloneqq& \texttt{((}\langle\text{AtomIdx}\rangle\texttt{,}\langle\text{AtomIdx}\rangle\texttt{),} \langle\text{AtomIdx}\rangle\texttt{)} \[3pt] \langle\text{BondAttack}\rangle &\Coloneqq& \texttt{((}\langle\text{AtomIdx}\rangle\texttt{,}\langle\text{AtomIdx}\rangle\texttt{),} \langle\text{AtomIdx}\rangle\texttt{)} \[3pt] \langle\text{AtomIdx}\rangle &\Coloneqq& \texttt{1}\mid\texttt{2}\mid\cdots\mid\texttt{9}\mid\texttt{10}\mid\cdots \end{array}$
Each SMILES is atom-mapped (e.g., [O-:2].[C:1](=O:3)), and arrows are expressed as tuples referencing these indices. Ionization and BondAttack share the same concrete syntax; context differentiates their semantics.
2. Arrow-Pushing Semantics
Three fundamental mechanistic arrow types are supported, all corresponding to a two-electron curved arrow in the classical notation. Lone pairs are tracked implicitly via atom valence and formal charges.
- Attack
(a, b)Removes one lone pair from atom , increments bond order between and by one. - Ionization
((a, b), b)Removes one bond order between and (or deletes the bond if single), places formal charge on , on ; no other changes. - BondAttack
((a, b), c)Removes one bond order between and , increments bond order between and by one; no changes to charges.
Explicit products are not included; the result is computed by applying the arrow(s) to the input graph.
3. Illustrative Canonical Examples
Each MechSMILES string concisely renders canonical arrow-pushing steps:
| Mechanistic Step | Mechanistic Cartoon | MechSMILES String |
|---|---|---|
| Nucleophilic Attack (HO⁻ on Carbonyl) | HO⁻:2 ⟶ C=O:1 | `[O-:2].C:1 |
| Proton Transfer to Alkoxide | H⁺:4 ⟶ O⁻:1 | `[H+:4].[O-:1] |
| Heterolytic Cleavage of C–C Bond | R–CH₂–CH₂–R ⟶ R–CH₂⁺ + ⁻CH₂–R | `[C:1]-[C:2] |
| σ Bond Attack in Elimination | σ(C1–C2) ⟶ Base:3 | ((1,2),3) |
In all cases, atom indices correspond to the integer map tags in SMILES.
4. Conservation Laws and Physical Validity
MechSMILES enforces strict conservation for both mass and charge at every step. For each transformation :
- Atomic counts:
- Formal charge:
Or, equivalently: where is the atomic number and the formal charge for atom .
This construction ensures that each elementary step redistributes, but never creates or annihilates, electrons and nuclei. All reaction moves thus remain chemically valid under standard valence and charge-consistency rules.
5. Encode, Decode, and Arrow-Inference Algorithms
The MechSMILES paradigm is fundamentally bidirectional: graph–string–graph. Key encode/decode operations are provided in rigorous pseudocode form.
- Encoding: Given a mapped graph and an arrow, produce a MechSMILES string by concatenating the mapped SMILES, a vertical bar, and the arrow spec.
1 2 3
function encode_step(G, arrow): smiles = canonicalMappedSmiles(G) return smiles + "|" + arrow.toString() - Decoding: Given a MechSMILES string, parse into SMILES and arrow list, then apply each arrow sequentially.
1 2 3 4 5 6 7
function decode_step(mech): [smiles, arrowList] = split_on_character(mech, '|') G = parseSmilesToGraph(smiles) for arrowText in split(arrowList, ';'): arrow = parseArrow(arrowText) applyArrow(G, arrow) return G - Arrow Inference: Given a pair of graphs , diff bond orders and formal charges to infer arrow(s) that map .
6. Tokenization, Training, and Application in LLMs
MechSMILES expands the standard SMILES tokenizer to include arrow-notation punctuation—specifically (, ), ,, ;, |, and an extended atom index vocabulary—yielding a total of approximately 260 tokens. For sequence-to-sequence machine learning, input strings are annotated with [reac], [prod], and [mech] to delimit reactants, products, and mechanism. Data augmentation presents reactions in both forward and retro formats; by-products are sometimes omitted to enhance stoichiometric learning.
The character efficiency of MechSMILES is notable, averaging approximately 77 characters per step—44% smaller than competing formats—thus enabling efficient transformer-based sequence modeling. Models trained on MechSMILES, such as T5 variants (, 4 layers, 6 heads), demonstrate >95% top-3 accuracy on elementary step prediction and high end-to-end mechanism accuracy (>90% on key benchmarks). Small beam widths (3–5) suffice for nearly exhaustive mechanism recovery across complex datasets, and transfer learning to new mechanistic classes is possible with only a few dozen expert-annotated samples.
7. Applications and Integration in Explainable CASP
By grounding mechanism predictions in explicit, conservation-respecting electron moves, MechSMILES enables mechanistically explainable CASP. Main application areas include:
- Post-hoc CASP Validation: Filtering out transformations lacking plausible electron flow.
- Holistic Atom Mapping: Tracking all atoms, including hydrogens, throughout reaction steps.
- Catalyst-aware Template Extraction: Automatically distinguishing recycled catalysts from spectator species.
Being agnostic to neural architecture, MechSMILES provides a benchmarking standard for mechanism prediction, facilitating mechanistic transparency and physically meaningful predictions within the broader workflow of computational synthesis planning (Neukomm et al., 5 Dec 2025).