ChemBART: Unified Organic Synthesis LLM

Updated 13 January 2026

ChemBART is a large language model for organic chemistry that uses reaction-level masked sequence reconstruction on SMILES to support diverse synthesis tasks.
It integrates multi-task applications such as precursor/reagent generation, reaction condition regression, and molecular property classification with experimentally validated outcomes.
Built on a BART-large Transformer with ~0.4 billion parameters, ChemBART achieves high accuracy in retrosynthesis and wet-lab validations for efficient synthesis planning.

ChemBART is a LLM tailored for organic chemistry applications, based on the BART-large Transformer architecture and pre-trained with reaction-level masked sequence reconstruction on SMILES notations. It enables a unified “one model, one pre-training, multiple tasks” paradigm, supporting precursor/reagent generation, reaction condition regression, molecular property classification, and reinforcement-learning-driven synthesis planning. ChemBART distinguishes itself from previous approaches through its ability to integrate multiple downstream chemical tasks and efficiently facilitate end-to-end computer-aided synthesis planning, with experimental validation highlighting its practical impact (Li et al., 6 Jan 2026).

1. Model Architecture and Tokenization

ChemBART uses the BART-large backbone, comprising 12-layer encoders and decoders, a hidden dimension $d=1024$ , 16 attention heads per layer, and a total of approximately 0.4 billion parameters. Reaction expressions are represented as “chemical sentences” in the following canonicalized SMILES format:

$\text{reactant} > \text{reagent} > \text{product}$

Tokenization is performed at the atom and mapping-index level, matching one token to each atom, mapping, or SMILES punctuation. The vocabulary contains ∼201 tokens, including atom symbols (e.g., C, O, Cl, Br), aromatic indicators, mapping/ring indices ($0-83$), and punctuation (e.g., [,],@,+,—, =, #), with special tokens <cls>, <end>, <msk>, <pad>. Each token $x_i$ is mapped to a learned embedding $e_i \in \mathbb{R}^{1024}$ , summed with a positional embedding.

The self-attention mechanism follows standard Transformer mapping:

$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$

Attention for one head:

$A = \text{softmax}(QK^\top/\sqrt{d_k}), \quad X' = AV$

Outputs from the 16 heads per layer are concatenated and linearly projected to $d=1024$ .

2. Pre-training Objective and Optimization

ChemBART is pre-trained with a masked sequence prediction objective at the reaction level. For each reaction SMILES input, either the reactant, reagent, or product section is randomly masked (tokens replaced by <msk>). The decoder autoregressively generates the entire “reactant > reagent > product” string, minimizing standard cross-entropy loss:

$L(\theta) = - \sum_{t=1}^T \log P_\theta(x_t | x_{<t}, M(x))$

where $M(x)$ is the masked input, and $\theta$ are the Transformer parameters.

Pre-training utilizes the USPTO-full reaction dataset (∼1.5M reactions) and the higher-quality USPTO-MIT subset (∼480K), optimized with AdamW ( $\text{lr}=1\times 10^{-5}$ , $\text{weight\_decay}=1\times 10^{-4}$ ), batch size 256, converging in ∼7 epochs.

3. Downstream Tasks

A single ChemBART checkpoint is fine-tuned with task-specific tokens and heads for various applications.

3.1 Precursor and Reagent Generation (Single-step Retrosynthesis)

Inputs are structured as product SMILES followed by > <msk> > to elicit reactant or reagent reconstruction, decoded via beam search (beam size=10).

Metrics: Top- $k$ accuracy (% correct among top- $k$ generations), syntax error rate.
Results (USPTO-full):
- Precursor prediction: Top-1 = 59.2%, Top-5 = 78.2%, Top-1 syntax errors = 2.43%
- Reagent prediction: Top-1 = 54.5%, Top-5 = 74.4%, syntax errors ≈1.5%
- These results match or exceed previous template-free models (PMSR, Molecular Transformer, Chemformer).

3.2 Reaction Temperature and Yield Regression

Inputs are complete reaction SMILES with <end/temp> and <end/yield> task tokens. Decoder outputs at these tokens are processed by linear heads, trained using mean-squared error.

Metrics: RMSE, MAE, $R^2$ .
ORD dataset (∼700K):
- Temperature: $R^2=0.55$ , MAE $=\pm 10^\circ$ C (range $-150...250^\circ$ C).
- Yield: $R^2=0.16$ , MAE $=\pm 23\%$ (range 0–110%).
Suzuki–Miyaura regression (5K):
- Ten-fold $R^2=0.804\pm0.017$ , MAE= $7.9\pm0.4\%$ , comparable to rxnfp.

3.3 Molecular Property Classification

Datasets include BBBP, HIV, BACE, Tox21, ClinTox. A <cls/task> token is prepended or appended, and the corresponding encoder embedding is processed by a sigmoid-activated linear head; optionally, LoRA low-rank adaptation is applied.

Metric: ROC-AUC.
Performance (ChemBART-M, Table 2):
- BBBP: 0.910; HIV: 0.809; BACE: 0.881; Tox21: 0.844; ClinTox: 0.866
- LoRA improves some tasks, e.g., ClinTox up to 0.920.
- ChemBART matches or outperforms previous SMILES-based LLMs and most graph-based approaches.

3.4 Reinforcement-Learning Policy and Value Optimization

A set of ∼12K retrosynthetic nodes, generated using ReSynZ's MCTS pipeline, is labeled with:

Value $v$ : discounted completion probability.
Policy $p$ : normalized child visit counts. ChemBART’s two regression heads are fine-tuned to predict $v$ (value head: SMILES $\to$ scalar) and $p$ (policy head: reaction SMILES $\to$ score, softmaxed across siblings).
Test RMSE (ChemBART-F/M): Value = 0.11, Policy = 0.16; comparable to template-based networks and superior to random-initialized baselines.

4. Multi-Step Synthesis Planning with MCTS

ChemBART is integrated into Monte Carlo Tree Search (MCTS) for end-to-end retrosynthetic route design.

Node Expansion and Scoring

At each tree node:

Beam-search generates $K$ candidate single-step retrosyntheses with probabilities $g_i$ .
Each candidate reaction $r_i$ is checked for validity and scored using the policy head ( $p_{2,i}$ ).
Candidates are normalized:

$p_i^* = \frac{g_i \cdot p_{2,i}}{\sum_j g_j \cdot p_{2,j}}$

Node value:

$v(n) = \text{ValueHead}(\text{SMILES}(n))$

MCTS selection employs a UCT-like rule (parameterization not explicitly provided) and backpropagates estimated values. Final root policies:

$\pi_{\text{root}}(i) = \frac{N_i^\tau}{\sum_j N_j^\tau}$

where $N_i=$ visit count, $\tau=$ temperature.

Planning Performance

Retro*-190 (190 targets): ChemBART-F/M achieve 64.9% / 70.1% full synthesis success rates, approaching template-based planners.
JMC2025 (53 recent pharmaceutical targets):
- Beam-search: 88.7% success in $\leq6$ steps (mean route length 4.3)
- Top-k sampling ( $k=10$ ): 84.9% success
- Top-p sampling ( $p=0.9$ ): 83.0% success

5. Experimental and Wet-Lab Validation

ChemBART-generated multistep routes have been empirically validated. For the PD-L1/VISTA dual inhibitor P1:

Literature route: 6 steps, overall yield 6.5%
ChemBART proposal: 4 steps, overall isolated yield 35% (5× improvement, +28.5% absolute), all conditions and intermediates confirmed with standard lab techniques (full NMR/HRMS provided).

The new pathway featured a convergent Suzuki coupling, efficient reductions, and optimized SNAr coupling conditions.

6. Interpretation, Limitations, and Future Prospects

ChemBART’s reaction-level masked pre-training automatically instills fundamental chemical syntax, valency, and mechanistic knowledge. A single parameter set supports generative, regression, classification, and reinforcement-learning-based policy/value computations, reducing computational and maintenance overhead.

ChemBART demonstrates interpretable chemistry by recovering periodic and electronegativity trends in token-embedding spaces and highlighting reactive motifs in attention maps (e.g., C→Br in Grignard reactions).

Limitations: Pure sequence-based SMILES input constrains the physical-chemical scope (e.g., QM9-type quantum descriptors). Potential for “hallucination” (invalid/novel outputs) increases under sampling-based decoding.

Future directions include:

Integration of 3D structure through graph or coordinate-based models (e.g., Uni-Mol, KFLM2)
Hybrid sequence-to-graph/contact-map attention modules
Further RL-based improvement via policy-gradient or offline MCTS retraining
Conditional precursor generation modulated by reaction class or functional groups

In conclusion, reaction-centric pre-training on SMILES data enables ChemBART to function as a versatile foundation model for organic synthesis, providing unified, experimentally-validated solutions for retrosynthesis, property prediction, and AI-driven planning (Li et al., 6 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

ChemBART: A Pre-trained BART Model Assisting Organic Chemistry Analysis (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ChemBART.