Papers
Topics
Authors
Recent
2000 character limit reached

ChemBART: Unified Organic Synthesis LLM

Updated 13 January 2026
  • ChemBART is a large language model for organic chemistry that uses reaction-level masked sequence reconstruction on SMILES to support diverse synthesis tasks.
  • It integrates multi-task applications such as precursor/reagent generation, reaction condition regression, and molecular property classification with experimentally validated outcomes.
  • Built on a BART-large Transformer with ~0.4 billion parameters, ChemBART achieves high accuracy in retrosynthesis and wet-lab validations for efficient synthesis planning.

ChemBART is a LLM tailored for organic chemistry applications, based on the BART-large Transformer architecture and pre-trained with reaction-level masked sequence reconstruction on SMILES notations. It enables a unified “one model, one pre-training, multiple tasks” paradigm, supporting precursor/reagent generation, reaction condition regression, molecular property classification, and reinforcement-learning-driven synthesis planning. ChemBART distinguishes itself from previous approaches through its ability to integrate multiple downstream chemical tasks and efficiently facilitate end-to-end computer-aided synthesis planning, with experimental validation highlighting its practical impact (Li et al., 6 Jan 2026).

1. Model Architecture and Tokenization

ChemBART uses the BART-large backbone, comprising 12-layer encoders and decoders, a hidden dimension d=1024d=1024, 16 attention heads per layer, and a total of approximately 0.4 billion parameters. Reaction expressions are represented as “chemical sentences” in the following canonicalized SMILES format:

reactant>reagent>product\text{reactant} > \text{reagent} > \text{product}

Tokenization is performed at the atom and mapping-index level, matching one token to each atom, mapping, or SMILES punctuation. The vocabulary contains ∼201 tokens, including atom symbols (e.g., C, O, Cl, Br), aromatic indicators, mapping/ring indices ($0-83$), and punctuation (e.g., [,],@,+,—, =, #), with special tokens <cls>, <end>, <msk>, <pad>. Each token xix_i is mapped to a learned embedding eiR1024e_i \in \mathbb{R}^{1024}, summed with a positional embedding.

The self-attention mechanism follows standard Transformer mapping:

Q=XWQ,K=XWK,V=XWVQ = X W_Q, \quad K = X W_K, \quad V = X W_V

Attention for one head:

A=softmax(QK/dk),X=AVA = \text{softmax}(QK^\top/\sqrt{d_k}), \quad X' = AV

Outputs from the 16 heads per layer are concatenated and linearly projected to d=1024d=1024.

2. Pre-training Objective and Optimization

ChemBART is pre-trained with a masked sequence prediction objective at the reaction level. For each reaction SMILES input, either the reactant, reagent, or product section is randomly masked (tokens replaced by <msk>). The decoder autoregressively generates the entire “reactant > reagent > product” string, minimizing standard cross-entropy loss:

L(θ)=t=1TlogPθ(xtx<t,M(x))L(\theta) = - \sum_{t=1}^T \log P_\theta(x_t | x_{<t}, M(x))

where M(x)M(x) is the masked input, and θ\theta are the Transformer parameters.

Pre-training utilizes the USPTO-full reaction dataset (∼1.5M reactions) and the higher-quality USPTO-MIT subset (∼480K), optimized with AdamW (lr=1×105\text{lr}=1\times 10^{-5}, weight_decay=1×104\text{weight\_decay}=1\times 10^{-4}), batch size 256, converging in ∼7 epochs.

3. Downstream Tasks

A single ChemBART checkpoint is fine-tuned with task-specific tokens and heads for various applications.

3.1 Precursor and Reagent Generation (Single-step Retrosynthesis)

Inputs are structured as product SMILES followed by > <msk> > to elicit reactant or reagent reconstruction, decoded via beam search (beam size=10).

  • Metrics: Top-kk accuracy (% correct among top-kk generations), syntax error rate.
  • Results (USPTO-full):
    • Precursor prediction: Top-1 = 59.2%, Top-5 = 78.2%, Top-1 syntax errors = 2.43%
    • Reagent prediction: Top-1 = 54.5%, Top-5 = 74.4%, syntax errors ≈1.5%
    • These results match or exceed previous template-free models (PMSR, Molecular Transformer, Chemformer).

3.2 Reaction Temperature and Yield Regression

Inputs are complete reaction SMILES with <end/temp> and <end/yield> task tokens. Decoder outputs at these tokens are processed by linear heads, trained using mean-squared error.

  • Metrics: RMSE, MAE, R2R^2.
  • ORD dataset (∼700K):
    • Temperature: R2=0.55R^2=0.55, MAE=±10=\pm 10^\circC (range 150...250-150...250^\circC).
    • Yield: R2=0.16R^2=0.16, MAE=±23%=\pm 23\% (range 0–110%).
  • Suzuki–Miyaura regression (5K):
    • Ten-fold R2=0.804±0.017R^2=0.804\pm0.017, MAE=7.9±0.4%7.9\pm0.4\%, comparable to rxnfp.

3.3 Molecular Property Classification

Datasets include BBBP, HIV, BACE, Tox21, ClinTox. A <cls/task> token is prepended or appended, and the corresponding encoder embedding is processed by a sigmoid-activated linear head; optionally, LoRA low-rank adaptation is applied.

  • Metric: ROC-AUC.
  • Performance (ChemBART-M, Table 2):
    • BBBP: 0.910; HIV: 0.809; BACE: 0.881; Tox21: 0.844; ClinTox: 0.866
    • LoRA improves some tasks, e.g., ClinTox up to 0.920.
    • ChemBART matches or outperforms previous SMILES-based LLMs and most graph-based approaches.

3.4 Reinforcement-Learning Policy and Value Optimization

A set of ∼12K retrosynthetic nodes, generated using ReSynZ's MCTS pipeline, is labeled with:

  • Value vv: discounted completion probability.
  • Policy pp: normalized child visit counts. ChemBART’s two regression heads are fine-tuned to predict vv (value head: SMILES \to scalar) and pp (policy head: reaction SMILES \to score, softmaxed across siblings).
  • Test RMSE (ChemBART-F/M): Value = 0.11, Policy = 0.16; comparable to template-based networks and superior to random-initialized baselines.

4. Multi-Step Synthesis Planning with MCTS

ChemBART is integrated into Monte Carlo Tree Search (MCTS) for end-to-end retrosynthetic route design.

Node Expansion and Scoring

At each tree node:

  1. Beam-search generates KK candidate single-step retrosyntheses with probabilities gig_i.
  2. Each candidate reaction rir_i is checked for validity and scored using the policy head (p2,ip_{2,i}).
  3. Candidates are normalized:

pi=gip2,ijgjp2,jp_i^* = \frac{g_i \cdot p_{2,i}}{\sum_j g_j \cdot p_{2,j}}

  1. Node value:

v(n)=ValueHead(SMILES(n))v(n) = \text{ValueHead}(\text{SMILES}(n))

MCTS selection employs a UCT-like rule (parameterization not explicitly provided) and backpropagates estimated values. Final root policies:

πroot(i)=NiτjNjτ\pi_{\text{root}}(i) = \frac{N_i^\tau}{\sum_j N_j^\tau}

where Ni=N_i= visit count, τ=\tau= temperature.

Planning Performance

  • Retro*-190 (190 targets): ChemBART-F/M achieve 64.9% / 70.1% full synthesis success rates, approaching template-based planners.
  • JMC2025 (53 recent pharmaceutical targets):
    • Beam-search: 88.7% success in 6\leq6 steps (mean route length 4.3)
    • Top-k sampling (k=10k=10): 84.9% success
    • Top-p sampling (p=0.9p=0.9): 83.0% success

5. Experimental and Wet-Lab Validation

ChemBART-generated multistep routes have been empirically validated. For the PD-L1/VISTA dual inhibitor P1:

  • Literature route: 6 steps, overall yield 6.5%
  • ChemBART proposal: 4 steps, overall isolated yield 35% (5× improvement, +28.5% absolute), all conditions and intermediates confirmed with standard lab techniques (full NMR/HRMS provided).

The new pathway featured a convergent Suzuki coupling, efficient reductions, and optimized SNAr coupling conditions.

6. Interpretation, Limitations, and Future Prospects

ChemBART’s reaction-level masked pre-training automatically instills fundamental chemical syntax, valency, and mechanistic knowledge. A single parameter set supports generative, regression, classification, and reinforcement-learning-based policy/value computations, reducing computational and maintenance overhead.

ChemBART demonstrates interpretable chemistry by recovering periodic and electronegativity trends in token-embedding spaces and highlighting reactive motifs in attention maps (e.g., C→Br in Grignard reactions).

Limitations: Pure sequence-based SMILES input constrains the physical-chemical scope (e.g., QM9-type quantum descriptors). Potential for “hallucination” (invalid/novel outputs) increases under sampling-based decoding.

Future directions include:

  • Integration of 3D structure through graph or coordinate-based models (e.g., Uni-Mol, KFLM2)
  • Hybrid sequence-to-graph/contact-map attention modules
  • Further RL-based improvement via policy-gradient or offline MCTS retraining
  • Conditional precursor generation modulated by reaction class or functional groups

In conclusion, reaction-centric pre-training on SMILES data enables ChemBART to function as a versatile foundation model for organic synthesis, providing unified, experimentally-validated solutions for retrosynthesis, property prediction, and AI-driven planning (Li et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ChemBART.