Papers
Topics
Authors
Recent
Search
2000 character limit reached

SynCraft: Reasoning-Driven Molecular Optimization

Updated 5 February 2026
  • SynCraft is a molecular optimization framework that reformulates synthesizability challenges as discrete graph-editing problems.
  • It integrates LLM-driven chain-of-thought reasoning with RDKit-based deterministic edits to ensure valid chemical modifications.
  • Evaluations on synthesis cliff datasets demonstrate improved synthetic feasibility and structural preservation compared to template-based methods.

SynCraft is a reasoning-driven molecular optimization framework designed to address challenges in molecular synthesizability encountered by generative models in computational chemistry. Unlike prior approaches based on post-hoc filtering or projection into pre-defined templates, SynCraft formulates synthesizability optimization as a discrete graph-editing problem, leveraging LLMs for edit planning and deterministic chemoinformatics toolkits for execution. Central to its design is the navigation of “synthesis cliffs,” regions where minimal atom- or bond-level modifications produce qualitatively enhanced synthetic feasibility while maintaining structural and pharmacophoric integrity (Li et al., 23 Dec 2025).

1. Framework Design and Workflow

SynCraft incorporates a two-component workflow. First, a LLM (Gemini-2.5-Pro) is prompted using Chain-of-Thought reasoning to identify synthetic liabilities within a molecule and propose an explicit, executable edit sequence. The edit commands are JSON-encoded and specify discrete atom-, bond-, and stereochemistry-level operations. Second, a deterministic chemoinformatics toolkit (RDKit) applies these edit operations to the molecular graph, ensuring chemical validity.

Traditional generative approaches issue the top-level prompt “generate a synthesizable analog of this SMILES.” In contrast, SynCraft issues the prompt “given identified synthetic liabilities (and optional biochemical constraints), propose a minimal sequence of atom- and bond-level edit commands.” This paradigm decouples molecular reasoning (LLM) from deterministic graph transformation (RDKit).

A synthesis cliff is operationally defined as a pair (Msrc,Mtgt)(M_\text{src}, M_\text{tgt}) where MtgtM_\text{tgt} is both highly similar (in ECFP4-Tanimoto space) and synthesizable, whereas MsrcM_\text{src} is unsynthesizable. Figure 1 of the reference provides qualitative illustrations of synthesis cliff navigation via edit sequences.

2. Formal Problem Specification

Molecules are modeled as labeled graphs G=(V,E)G = (V, E), with atoms V={v1,,vn}V=\{v_1, \ldots, v_n\} carrying element labels (vi){C,N,}\ell(v_i)\in\{\text{C},\text{N},\ldots\} and bonds EV×VE \subset V \times V annotated with orders o(e){1,2,3,aromatic}o(e) \in \{1,2,3,\text{aromatic}\}. An edit sequence a=(a1,a2,,aT)\mathbf{a} = (a_1, a_2, \ldots, a_T), atAa_t \in \mathcal{A}, is drawn from a defined action space A\mathcal{A}.

The synthesizability optimization objective is formalized as: a=argmaxaSim(Gsrc,Gsrca)s.t.Retro(Gsrca)=1,\mathbf{a}^* = \arg\max_{\mathbf{a}}\, \mathrm{Sim}(G_\text{src}, G_\text{src} \oplus \mathbf{a}) \quad \text{s.t.}\quad \mathrm{Retro}(G_\text{src} \oplus \mathbf{a}) = 1, where a\oplus \mathbf{a} denotes sequential application of a\mathbf{a} to GsrcG_\text{src}, Sim(,)\mathrm{Sim}(\cdot, \cdot) is the ECFP4-Tanimoto similarity, and Retro()\mathrm{Retro}(\cdot) is a binary indicator denoting whether SimpRetro can find a valid retrosynthetic route within 30 minutes. The approach is extensible to continuous synthetic accessibility scores S(G)S(G).

3. Edit Space and LLM-Driven Editing

3.1 Edit Operations

The edit action set A\mathcal{A} includes:

  • DEL_ATOM(i)\text{DEL\_ATOM}(i): Remove atom ii and its bonds.
  • MUTATE_ATOM(i,XY)\text{MUTATE\_ATOM}(i, X \rightarrow Y): Change atom ii’s element.
  • ADD_ATOM(j,Y)\text{ADD\_ATOM}(j, Y): Introduce new atom j500j \geq 500 with element YY.
  • ADD_BOND(i,j,order)\text{ADD\_BOND}(i,j, \text{order}): Add a bond.
  • DEL_BOND(i,j)\text{DEL\_BOND}(i,j): Remove specified bond.
  • CHANGE_BOND(i,j,new_order)\text{CHANGE\_BOND}(i,j, \text{new\_order}): Adjust bond parameters/aromaticity.
  • SET_CHIRAL(i,R/S)\text{SET\_CHIRAL}(i, R/S) and SET_BOND_STEREO(i,j,E/Z)\text{SET\_BOND\_STEREO}(i,j, E/Z): Assign stereochemistry.

These operations are parameterized by atom-map identifiers and, where applicable, bond orders and stereochemical descriptors.

3.2 Prompt Engineering and Inference Pipeline

SynCraft uses retrieval-augmented few-shot prompting. For a given input, it retrieves the top-kk (k=5k=5) most relevant synthesis cliff pairs (Msrc,Mtgt)(M_\text{src}, M_\text{tgt}) and formats each example as:

  • Source SMILES string
  • Chain-of-Thought reasoning statement describing liabilities
  • JSON array with the edit sequence

The LLM is instructed to output first an analysis, then the sequence of edits. The core inference pipeline is captured in the following pseudocode:

1
2
3
4
5
6
7
function Optimize(G_src):
    D_knn = RetrieveTopK(G_src, k=5)
    prompt = BuildPrompt(G_src, D_knn)
    LLM_out = LLM_Generate(prompt)
    a = ParseJSON(LLM_out)
    G_new = ApplyEdits(G_src, a)
    return G_new

Prompt instructions enforce concise natural-language reasoning followed by edit command output, minimizing syntactic deviations.

4. Optimization, Constraints, and Loss Formulation

Interaction-aware prompting extends SynCraft to incorporate biological constraints derived from structure-based data. A PLIP-derived interaction profile, after docking the ligand into a target protein, is mapped to atom-map indices and described per atom, e.g.,

1
Atom [i] (Element E, connected to [N]): interaction_data
This is prepended under “[CRITICAL BIOLOGICAL CONSTRAINTS],” explicitly instructing the LLM to preserve or only bioisosterically modify these atoms.

No neural loss is trained; instead, explicit prompting guides the LLM to optimize for

  • Structural fidelity (ECFP4-Tanimoto similarity)
  • Synthetically feasible output (Retro(Gnew)=1)(\mathrm{Retro}(G_\text{new})=1)
  • Optional bio-constraints

Ablation results indicate edit-sequence generation avoids “SMILES hallucinations” common with direct sequence generation, yielding higher output validity and similarity at each threshold.

5. Datasets, Benchmarks, and Comparative Evaluation

SynCraft was evaluated using:

  • The Synthesis Cliff dataset, containing 3,332 (Msrc,Mtgt)(M_\text{src}, M_\text{tgt}) pairs from GenBench3D outputs (five generative models), matched to eMolecules neighbors (Tanimoto>0.5(\text{Tanimoto} > 0.5, Pharm2D>0.5)\text{Pharm2D} > 0.5).
  • Test sets: 1,025 unsynthesizable Pocket2Mol (P2M) compounds and 1,725 ResGen outputs, both labeled by SimpRetro feasibility.

Baselines included ChemProjector, SynFormer, and ReaSyn (projection-based). The primary metric is the fraction of cases for which a synthesizable analog GnewG_\text{new} satisfies Sim(Gsrc,Gnew)>τ\mathrm{Sim}(G_\text{src}, G_\text{new}) > \tau and Retro(Gnew)=1\mathrm{Retro}(G_\text{new}) = 1 for τ{0.5,0.6,0.7,0.8}\tau \in \{0.5, 0.6, 0.7, 0.8\}.

Dataset Similarity τ\tau SynCraft (%) Best Baseline (%)
Pocket2Mol >0.5>0.5 42.7 \approx30
Pocket2Mol >0.6>0.6 28.4 15.4
ResGen >0.5>0.5 44.7 37.5
ResGen >0.6>0.6 29.1 22.0

Across thresholds, SynCraft consistently outperforms baselines, especially in the medium similarity regime.

6. Medicinal Chemistry Case Studies

6.1 PLK1 Inhibitor Editing (Retrospective)

For input “lig-886” (bearing a 2,5-dimethylpiperazine core causing two chiral centers), SynCraft’s output chain-of-thought was: “Remove methyl groups to avoid stereochemical mixtures.” The executed DEL_BOND and MUTATE_ATOM edits yielded a simple piperazine, matching the human-synthesized IIP0944 lead’s scaffold.

6.2 RIPK1 Candidate Rescue (Prospective)

Case I: For a biaryl linkage between electron-deficient rings, SynCraft suggested inserting an ether via ADD_ATOM (O) and CHANGE_BOND. This enabled a C–O bond formation pathway in retrosynthesis, bypassing problematic cross-coupling, and preserved H-bonding (Vina docking: 10.710.2-10.7 \rightarrow -10.2 kcal/mol).

Case II: Modifying a fused diazepino-purine to a 2,4-disubstituted pyrimidine (scaffold hop), SynCraft’s edits preserved substituent vectors and H-bond acceptor at N1. Docking showed maintained binding efficiency (10.29.9-10.2 \rightarrow -9.9 kcal/mol).

7. Limitations and Future Research

SynCraft reframes synthesizability as a targeted, minimal graph-editing problem. This approach outperforms template-projection methods, which may introduce large topological distortions. Its modular architecture—separating LLM-driven reasoning from deterministic chemoinformatics execution—guarantees valid chemical outputs while enabling explicit pharmacophore preservation via interaction-aware prompting.

Reported limitations include operational costs (approximately \$3–5 per 100 optimizations), limiting suitability for ultra-large library enumeration. Proposed future directions include:

  • Integration of a learned, continuous feasibility–fidelity loss
  • Automated retrosynthetic feedback loops
  • Extension to multi-objective optimization (e.g., ADMET properties)
  • Adaptation to broader graph-structured domains

Source code, datasets, and full prompt templates are publicly available at github.com/catalystforyou/SynCraft-Core (Li et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SynCraft.