ChemBART: Unified Organic Synthesis LLM
- ChemBART is a large language model for organic chemistry that uses reaction-level masked sequence reconstruction on SMILES to support diverse synthesis tasks.
- It integrates multi-task applications such as precursor/reagent generation, reaction condition regression, and molecular property classification with experimentally validated outcomes.
- Built on a BART-large Transformer with ~0.4 billion parameters, ChemBART achieves high accuracy in retrosynthesis and wet-lab validations for efficient synthesis planning.
ChemBART is a LLM tailored for organic chemistry applications, based on the BART-large Transformer architecture and pre-trained with reaction-level masked sequence reconstruction on SMILES notations. It enables a unified “one model, one pre-training, multiple tasks” paradigm, supporting precursor/reagent generation, reaction condition regression, molecular property classification, and reinforcement-learning-driven synthesis planning. ChemBART distinguishes itself from previous approaches through its ability to integrate multiple downstream chemical tasks and efficiently facilitate end-to-end computer-aided synthesis planning, with experimental validation highlighting its practical impact (Li et al., 6 Jan 2026).
1. Model Architecture and Tokenization
ChemBART uses the BART-large backbone, comprising 12-layer encoders and decoders, a hidden dimension , 16 attention heads per layer, and a total of approximately 0.4 billion parameters. Reaction expressions are represented as “chemical sentences” in the following canonicalized SMILES format:
Tokenization is performed at the atom and mapping-index level, matching one token to each atom, mapping, or SMILES punctuation. The vocabulary contains ∼201 tokens, including atom symbols (e.g., C, O, Cl, Br), aromatic indicators, mapping/ring indices ($0-83$), and punctuation (e.g., [,],@,+,—, =, #), with special tokens <cls>, <end>, <msk>, <pad>. Each token is mapped to a learned embedding , summed with a positional embedding.
The self-attention mechanism follows standard Transformer mapping:
Attention for one head:
Outputs from the 16 heads per layer are concatenated and linearly projected to .
2. Pre-training Objective and Optimization
ChemBART is pre-trained with a masked sequence prediction objective at the reaction level. For each reaction SMILES input, either the reactant, reagent, or product section is randomly masked (tokens replaced by <msk>). The decoder autoregressively generates the entire “reactant > reagent > product” string, minimizing standard cross-entropy loss:
where is the masked input, and are the Transformer parameters.
Pre-training utilizes the USPTO-full reaction dataset (∼1.5M reactions) and the higher-quality USPTO-MIT subset (∼480K), optimized with AdamW (, ), batch size 256, converging in ∼7 epochs.
3. Downstream Tasks
A single ChemBART checkpoint is fine-tuned with task-specific tokens and heads for various applications.
3.1 Precursor and Reagent Generation (Single-step Retrosynthesis)
Inputs are structured as product SMILES followed by > <msk> > to elicit reactant or reagent reconstruction, decoded via beam search (beam size=10).
- Metrics: Top- accuracy (% correct among top- generations), syntax error rate.
- Results (USPTO-full):
- Precursor prediction: Top-1 = 59.2%, Top-5 = 78.2%, Top-1 syntax errors = 2.43%
- Reagent prediction: Top-1 = 54.5%, Top-5 = 74.4%, syntax errors ≈1.5%
- These results match or exceed previous template-free models (PMSR, Molecular Transformer, Chemformer).
3.2 Reaction Temperature and Yield Regression
Inputs are complete reaction SMILES with <end/temp> and <end/yield> task tokens. Decoder outputs at these tokens are processed by linear heads, trained using mean-squared error.
- Metrics: RMSE, MAE, .
- ORD dataset (∼700K):
- Temperature: , MAEC (range C).
- Yield: , MAE (range 0–110%).
- Suzuki–Miyaura regression (5K):
- Ten-fold , MAE=, comparable to rxnfp.
3.3 Molecular Property Classification
Datasets include BBBP, HIV, BACE, Tox21, ClinTox. A <cls/task> token is prepended or appended, and the corresponding encoder embedding is processed by a sigmoid-activated linear head; optionally, LoRA low-rank adaptation is applied.
- Metric: ROC-AUC.
- Performance (ChemBART-M, Table 2):
- BBBP: 0.910; HIV: 0.809; BACE: 0.881; Tox21: 0.844; ClinTox: 0.866
- LoRA improves some tasks, e.g., ClinTox up to 0.920.
- ChemBART matches or outperforms previous SMILES-based LLMs and most graph-based approaches.
3.4 Reinforcement-Learning Policy and Value Optimization
A set of ∼12K retrosynthetic nodes, generated using ReSynZ's MCTS pipeline, is labeled with:
- Value : discounted completion probability.
- Policy : normalized child visit counts. ChemBART’s two regression heads are fine-tuned to predict (value head: SMILES scalar) and (policy head: reaction SMILES score, softmaxed across siblings).
- Test RMSE (ChemBART-F/M): Value = 0.11, Policy = 0.16; comparable to template-based networks and superior to random-initialized baselines.
4. Multi-Step Synthesis Planning with MCTS
ChemBART is integrated into Monte Carlo Tree Search (MCTS) for end-to-end retrosynthetic route design.
Node Expansion and Scoring
At each tree node:
- Beam-search generates candidate single-step retrosyntheses with probabilities .
- Each candidate reaction is checked for validity and scored using the policy head ().
- Candidates are normalized:
- Node value:
MCTS selection employs a UCT-like rule (parameterization not explicitly provided) and backpropagates estimated values. Final root policies:
where visit count, temperature.
Planning Performance
- Retro*-190 (190 targets): ChemBART-F/M achieve 64.9% / 70.1% full synthesis success rates, approaching template-based planners.
- JMC2025 (53 recent pharmaceutical targets):
- Beam-search: 88.7% success in steps (mean route length 4.3)
- Top-k sampling (): 84.9% success
- Top-p sampling (): 83.0% success
5. Experimental and Wet-Lab Validation
ChemBART-generated multistep routes have been empirically validated. For the PD-L1/VISTA dual inhibitor P1:
- Literature route: 6 steps, overall yield 6.5%
- ChemBART proposal: 4 steps, overall isolated yield 35% (5× improvement, +28.5% absolute), all conditions and intermediates confirmed with standard lab techniques (full NMR/HRMS provided).
The new pathway featured a convergent Suzuki coupling, efficient reductions, and optimized SNAr coupling conditions.
6. Interpretation, Limitations, and Future Prospects
ChemBART’s reaction-level masked pre-training automatically instills fundamental chemical syntax, valency, and mechanistic knowledge. A single parameter set supports generative, regression, classification, and reinforcement-learning-based policy/value computations, reducing computational and maintenance overhead.
ChemBART demonstrates interpretable chemistry by recovering periodic and electronegativity trends in token-embedding spaces and highlighting reactive motifs in attention maps (e.g., C→Br in Grignard reactions).
Limitations: Pure sequence-based SMILES input constrains the physical-chemical scope (e.g., QM9-type quantum descriptors). Potential for “hallucination” (invalid/novel outputs) increases under sampling-based decoding.
Future directions include:
- Integration of 3D structure through graph or coordinate-based models (e.g., Uni-Mol, KFLM2)
- Hybrid sequence-to-graph/contact-map attention modules
- Further RL-based improvement via policy-gradient or offline MCTS retraining
- Conditional precursor generation modulated by reaction class or functional groups
In conclusion, reaction-centric pre-training on SMILES data enables ChemBART to function as a versatile foundation model for organic synthesis, providing unified, experimentally-validated solutions for retrosynthesis, property prediction, and AI-driven planning (Li et al., 6 Jan 2026).