SynCraft: Reasoning-Driven Molecular Optimization
- SynCraft is a molecular optimization framework that reformulates synthesizability challenges as discrete graph-editing problems.
- It integrates LLM-driven chain-of-thought reasoning with RDKit-based deterministic edits to ensure valid chemical modifications.
- Evaluations on synthesis cliff datasets demonstrate improved synthetic feasibility and structural preservation compared to template-based methods.
SynCraft is a reasoning-driven molecular optimization framework designed to address challenges in molecular synthesizability encountered by generative models in computational chemistry. Unlike prior approaches based on post-hoc filtering or projection into pre-defined templates, SynCraft formulates synthesizability optimization as a discrete graph-editing problem, leveraging LLMs for edit planning and deterministic chemoinformatics toolkits for execution. Central to its design is the navigation of “synthesis cliffs,” regions where minimal atom- or bond-level modifications produce qualitatively enhanced synthetic feasibility while maintaining structural and pharmacophoric integrity (Li et al., 23 Dec 2025).
1. Framework Design and Workflow
SynCraft incorporates a two-component workflow. First, a LLM (Gemini-2.5-Pro) is prompted using Chain-of-Thought reasoning to identify synthetic liabilities within a molecule and propose an explicit, executable edit sequence. The edit commands are JSON-encoded and specify discrete atom-, bond-, and stereochemistry-level operations. Second, a deterministic chemoinformatics toolkit (RDKit) applies these edit operations to the molecular graph, ensuring chemical validity.
Traditional generative approaches issue the top-level prompt “generate a synthesizable analog of this SMILES.” In contrast, SynCraft issues the prompt “given identified synthetic liabilities (and optional biochemical constraints), propose a minimal sequence of atom- and bond-level edit commands.” This paradigm decouples molecular reasoning (LLM) from deterministic graph transformation (RDKit).
A synthesis cliff is operationally defined as a pair where is both highly similar (in ECFP4-Tanimoto space) and synthesizable, whereas is unsynthesizable. Figure 1 of the reference provides qualitative illustrations of synthesis cliff navigation via edit sequences.
2. Formal Problem Specification
Molecules are modeled as labeled graphs , with atoms carrying element labels and bonds annotated with orders . An edit sequence , , is drawn from a defined action space .
The synthesizability optimization objective is formalized as: where denotes sequential application of to , is the ECFP4-Tanimoto similarity, and is a binary indicator denoting whether SimpRetro can find a valid retrosynthetic route within 30 minutes. The approach is extensible to continuous synthetic accessibility scores .
3. Edit Space and LLM-Driven Editing
3.1 Edit Operations
The edit action set includes:
- : Remove atom and its bonds.
- : Change atom ’s element.
- : Introduce new atom with element .
- : Add a bond.
- : Remove specified bond.
- : Adjust bond parameters/aromaticity.
- and : Assign stereochemistry.
These operations are parameterized by atom-map identifiers and, where applicable, bond orders and stereochemical descriptors.
3.2 Prompt Engineering and Inference Pipeline
SynCraft uses retrieval-augmented few-shot prompting. For a given input, it retrieves the top- () most relevant synthesis cliff pairs and formats each example as:
- Source SMILES string
- Chain-of-Thought reasoning statement describing liabilities
- JSON array with the edit sequence
The LLM is instructed to output first an analysis, then the sequence of edits. The core inference pipeline is captured in the following pseudocode:
1 2 3 4 5 6 7 |
function Optimize(G_src):
D_knn = RetrieveTopK(G_src, k=5)
prompt = BuildPrompt(G_src, D_knn)
LLM_out = LLM_Generate(prompt)
a = ParseJSON(LLM_out)
G_new = ApplyEdits(G_src, a)
return G_new |
Prompt instructions enforce concise natural-language reasoning followed by edit command output, minimizing syntactic deviations.
4. Optimization, Constraints, and Loss Formulation
Interaction-aware prompting extends SynCraft to incorporate biological constraints derived from structure-based data. A PLIP-derived interaction profile, after docking the ligand into a target protein, is mapped to atom-map indices and described per atom, e.g.,
1 |
Atom [i] (Element E, connected to [N]): interaction_data |
No neural loss is trained; instead, explicit prompting guides the LLM to optimize for
- Structural fidelity (ECFP4-Tanimoto similarity)
- Synthetically feasible output
- Optional bio-constraints
Ablation results indicate edit-sequence generation avoids “SMILES hallucinations” common with direct sequence generation, yielding higher output validity and similarity at each threshold.
5. Datasets, Benchmarks, and Comparative Evaluation
SynCraft was evaluated using:
- The Synthesis Cliff dataset, containing 3,332 pairs from GenBench3D outputs (five generative models), matched to eMolecules neighbors , .
- Test sets: 1,025 unsynthesizable Pocket2Mol (P2M) compounds and 1,725 ResGen outputs, both labeled by SimpRetro feasibility.
Baselines included ChemProjector, SynFormer, and ReaSyn (projection-based). The primary metric is the fraction of cases for which a synthesizable analog satisfies and for .
| Dataset | Similarity | SynCraft (%) | Best Baseline (%) |
|---|---|---|---|
| Pocket2Mol | 42.7 | 30 | |
| Pocket2Mol | 28.4 | 15.4 | |
| ResGen | 44.7 | 37.5 | |
| ResGen | 29.1 | 22.0 |
Across thresholds, SynCraft consistently outperforms baselines, especially in the medium similarity regime.
6. Medicinal Chemistry Case Studies
6.1 PLK1 Inhibitor Editing (Retrospective)
For input “lig-886” (bearing a 2,5-dimethylpiperazine core causing two chiral centers), SynCraft’s output chain-of-thought was: “Remove methyl groups to avoid stereochemical mixtures.” The executed DEL_BOND and MUTATE_ATOM edits yielded a simple piperazine, matching the human-synthesized IIP0944 lead’s scaffold.
6.2 RIPK1 Candidate Rescue (Prospective)
Case I: For a biaryl linkage between electron-deficient rings, SynCraft suggested inserting an ether via ADD_ATOM (O) and CHANGE_BOND. This enabled a C–O bond formation pathway in retrosynthesis, bypassing problematic cross-coupling, and preserved H-bonding (Vina docking: kcal/mol).
Case II: Modifying a fused diazepino-purine to a 2,4-disubstituted pyrimidine (scaffold hop), SynCraft’s edits preserved substituent vectors and H-bond acceptor at N1. Docking showed maintained binding efficiency ( kcal/mol).
7. Limitations and Future Research
SynCraft reframes synthesizability as a targeted, minimal graph-editing problem. This approach outperforms template-projection methods, which may introduce large topological distortions. Its modular architecture—separating LLM-driven reasoning from deterministic chemoinformatics execution—guarantees valid chemical outputs while enabling explicit pharmacophore preservation via interaction-aware prompting.
Reported limitations include operational costs (approximately \$3–5 per 100 optimizations), limiting suitability for ultra-large library enumeration. Proposed future directions include:
- Integration of a learned, continuous feasibility–fidelity loss
- Automated retrosynthetic feedback loops
- Extension to multi-objective optimization (e.g., ADMET properties)
- Adaptation to broader graph-structured domains
Source code, datasets, and full prompt templates are publicly available at github.com/catalystforyou/SynCraft-Core (Li et al., 23 Dec 2025).