QFANG: Transformer-Based Chemical Synthesis Reasoning
- QFANG is a domain-specialized transformer model that automates the generation of machine-readable organic synthesis procedures from abstract chemical equations.
- It interleaves chain-of-thought reasoning with an action-sequence output to mimic chemist decision-making and deliver executable lab protocols.
- Trained on nearly one million high-fidelity reaction–procedure pairs using supervised and reinforcement learning, QFANG demonstrates high procedural accuracy and robust generalization.
QFANG (“Qianfang,” 千方) is a domain-specialized, transformer-based scientific reasoning model that automates the generation of machine-readable organic synthesis procedures from abstract chemical reaction equations. Functionally, it aims to bridge the methodological gap between in silico retrosynthetic design and practical laboratory synthesis, producing both a strategy-level “chain-of-thought” (CoT) mimicking chemist reasoning and a detailed, executable action sequence suitable for robotic workflow integration. QFANG’s architecture and training regime are explicitly tailored to learn not only procedural steps but also chemically meaningful justifications, enabling high-fidelity generalization across reaction classes, laboratory constraints, and novel transformation types (Liu et al., 15 Dec 2025).
1. Model Architecture and Inference Workflow
QFANG is implemented atop the Qwen-3 decoder-only transformer suite, with both 8-billion and 32-billion parameter instantiations. The inference process is a single left-to-right autoregressive pass that interleaves two outputs:
- The CoT Reasoning Head, which produces a sequential natural-language rationale for each experimental decision, explicitly reflecting logical choices (e.g., selectivity, reagent selection).
- The Action-Sequence Head, which consumes both the reaction equation and the generated CoT to output an ordered list of structured actions , where each is one of 24 predefined laboratory actions with typed argument signatures (mixtures, reagents, times, conditions).
This design ensures that QFANG “thinks like a chemist” before writing machine-readable protocols, tightly coupling high-level reasoning and concrete execution, and producing output directly compatible with robotic platforms (Liu et al., 15 Dec 2025).
2. Dataset Curation and Preprocessing
QFANG’s training corpus consists of 905,990 high-fidelity reaction–procedure pairs, derived from an initial subset of 4.4 million entries from the Pistachio chemical patent database. A three-stage annotation pipeline is employed:
- Coreference Resolution: Utilization of GPT-4o to unify chemical entity references and discard low-fidelity texts.
- Action Code Generation: Implementation of a 24-action Python API and prompting GPT-4o to translate patent procedures into executable scripts; only syntactically and semantically valid samples are retained.
- Verification and Filtering: GPT-4o is further prompted to compare original paragraphs and generated action lists, producing a reasoning trace and a consistency/confidence label. Only samples with “Yes” labels and confidence ≥ 4 are retained.
This pipeline, coupled with further deduplication (removing invalid SMILES, atom-mapping errors, etc.), distills 4.4M candidates to under 1M gold-standard pairs. The dataset spans all major transformation classes; median action-sequence length is 12 (Liu et al., 15 Dec 2025).
3. Chemistry-Guided Reasoning (CGR) Framework
General LLMs often suffer from domain-specific hallucinations and lack robust chemical logic. QFANG addresses this by incorporating a two-stage Chemistry-Guided Reasoning (CGR) mechanism:
- Programmatic Factual Skeleton Generation: Automated atom-mapping identifies bond changes, functional group transformations, and challenges such as selectivity and solubility. Each reaction is decomposed into a skeleton of chemical facts and operational constraints.
- LLM-Based Narrative Enhancement: A large-scale LLM (Qwen3-235B-Thinking) is prompted with , the ground-truth action sequence , and skeleton to generate a fluent CoT justification, yielding a paired dataset whose reasoning is both chemically factual and stylistically consistent.
This approach instills robust, fact-grounded reasoning into the training objects suitable for direct supervision (Liu et al., 15 Dec 2025).
4. Supervised Training and RL Fine-Tuning
Supervised fine-tuning (SFT) is applied by pairing each reaction , its CoT , and its structured action list , optimizing the standard cross-entropy loss over the concatenated token sequence: where .
To further maximize procedural accuracy, QFANG undergoes Reinforcement Learning from Verifiable Rewards (RLVR), employing a per-step reward structure entirely defined by observable procedural correctness:
- Per-Step Accuracy Reward (), including penalties for formatting, action-type mismatches, and missing/optional parameters.
- Exceeding-Step Penalty for superfluous actions.
- Action-Type Distribution Modifier adjusts the reward to avoid overproduction of trivial actions.
PPO (Proximal Policy Optimization) is used for the RL phase, and Group Relative Policy Optimization (GRPO) is also tested. The RLVR procedure is fully verifiable and does not rely on proxy scoring (Liu et al., 15 Dec 2025).
5. Evaluation Metrics and Comparative Performance
QFANG’s evaluation is multidimensional. On standard NLP similarity metrics:
- QFANG-8B (RL) achieves BLEU-4 = 61.3 compared to GPT-5 (3-shot) at 54.4, ROUGE-L = 61.1 vs. 55.9, Seq-O = 70.9 vs. 61.1.
For domain correctness, an LLM-as-a-judge (GPT-5 high) is used, with a detailed chemical rubric. Key results:
- QFANG-8B (RL) achieves a mean composite score of 78.2/100, nearing the oracle (90.5) and substantially surpassing the best GPT-5 baseline (67.8).
- QFANG consistently outperforms retrieval-based and in-context learning baselines on both in-distribution and procedurally/chemically dissimilar (out-of-domain) reactions.
Table: Representative QFANG Case Studies and Outcomes
| Application Scenario | QFANG Outcome | Baseline/Comparison |
|---|---|---|
| Novel organometallic cycloaromatization | Identifies solubility, proposes THF/DCB | GPT-5 fails to propose solvent |
| Selective imine reduction | Selects NaBH₄, matches literature | GPT-5 suggests harsh hydrogenation |
| User-constrained amide coupling | Cost-efficient or high-purity protocol | N/A |
| Process-scale Suzuki coupling (50 kg) | Forgets chromatography for filtration | N/A |
| Correction of flawed patent stoichiometry | Corrects 4× excess to 1.2 equiv + DMAP | N/A |
6. Adaptivity, Generalization, and Limitations
QFANG demonstrates:
- Generalization to out-of-domain reaction classes, maintaining procedural integrity where retrieval templates collapse.
- Adaptivity to explicit user constraints (e.g., cost, equipment, purity), dynamically producing alternative protocols upon request.
- The capability to identify and correct errors in noisy patent data.
Limitations include the current reward structure's dependence on the exact ground-truth action ordering and the absence of explicit stoichiometric, kinetic, or thermodynamic checking. Extensions to mass-balance/thermodynamic consistency, multimodal input (e.g., spectra, equipment), and closed-loop feedback with robotic systems are targeted as future directions (Liu et al., 15 Dec 2025).
7. Implications for Automated Chemistry and Outlook
QFANG establishes a paradigm in which an LLM with explicit scientific reasoning and verifiable procedural competence can act as an intermediary between computational route planning and real-world synthesis automation. By learning both strategic and operational priors on a large, systematically filtered, and chemically validated corpus, QFANG sets a benchmark for accuracy and generalizability. A plausible implication is that frameworks akin to QFANG may become the backbone of practical, autonomous laboratory execution, enabling real-time adaptation to chemical novelty and laboratory constraints (Liu et al., 15 Dec 2025).