FragmentGPT: Unified Model for FBDD

Updated 22 September 2025

FragmentGPT is a unified Transformer-based generative model for molecular design that addresses fragment-based drug discovery by enabling fragment growing, linking, and merging.
It employs a chemically aware, energy-based bond cleavage protocol during pre-training to generate realistic fragment boundaries and ensure chemical validity.
The model integrates a multi-stage Reward Ranked Alignment algorithm with expert exploration to optimize multi-objective pharmaceutical goals and resolve structural redundancies.

FragmentGPT is a unified Transformer-based generative model for molecular design, specifically targeting the challenges of fragment-based drug discovery (FBDD) such as efficient fragment linking, fragment growing, and the resolution of structural redundancies like duplicate rings in merged fragments. The model integrates a chemically aware, energy-based bond cleavage strategy during pre-training and introduces a multi-stage Reward Ranked Alignment with Expert Exploration (RAE) algorithm, allowing it to assemble optimized molecules from diverse molecular subunits conditioned on multi-objective pharmaceutical goals.

1. Model Architecture and Input Formatting

FragmentGPT is built on a GPT-2–style Transformer backbone (124M parameters), trained to autoregressively decode SMILES representations of molecules. The model ingests inputs structured for specific tasks:

Fragment Growing: Expands a seed fragment to elaborate molecular scaffolds.
Fragment Linking: Receives a pair of disconnected fragments (A, C) and produces a chemically valid linker fragment to connect them.
Fragment Merging: Takes pairs of overlapping fragments (e.g., A+B and B+C), requiring the model to resolve structural overlaps intelligently.

Fragment prompts are encoded with special tokens (e.g., <p1>, <p2>, <L>) clearly demarcating the substructures in the input. The architecture is explicitly tuned such that the language modeling predicts the “completion” of a chemically valid molecule from provided partial fragments.

2. Chemically Aware, Energy-Based Bond Cleavage Pre-training

To ensure the model masters fragment assembly operations, the training corpus construction uses a computational chemistry-driven bond cleavage protocol:

All molecules in the pre-training set (e.g., from ZINC) are converted to molecular graphs. Each bond is annotated with its dissociation energy.
Only bonds with dissociation energy $E_\text{bond} \leq 90$ kcal/mol (approx. 377 kJ/mol) are cleaved, ensuring fragment boundaries fall at chemically realistic break points. Aromatic bonds are protected from cleavage.
Each molecule is fragmented into up to three components, ensuring representations for growing, linking, and merging within a unified SMILES context.
The training loss follows a standard negative log-likelihood across the tokenized output, conditioned on the input fragment prompt:

$\operatorname{NLL} = -\sum_{t=1}^{T} \log P(y_t \mid y_{1}, \ldots, y_{t-1}, F)$

where $F$ denotes the fragment prompt.

This decomposition enriches the training set with diverse, chemically meaningful assembly tasks crucial for multi-fragment molecular design.

3. Reward Ranked Alignment with Expert Exploration (RAE)

The RAE learning regime systematically aligns the generative model with multi-objective design criteria and increases output diversity:

Supervised Fine-Tuning (SFT): Starting from the pre-trained model, fine-tuning is performed on fragment assembly tasks extracted from a hand-crafted corpus, ensuring learning of generic growing/linking/merging policies.
Expert Exploration: The model iteratively samples new molecules using its current policy and leverages external “expert” models (ScaffoldGPT, ControllableGPT) to generate further candidates. This mixture enhances exploration, driving the policy beyond local optima and introducing structurally diverse solutions, especially effective in uncharted molecular spaces.
Data Selection and Augmentation: Candidate molecules are scored using both Pareto fronts (multi-objective non-dominated sorting) and composite reward metrics (standardized sum of property predictors: druglikeness, solubility, synthesizability, docking, and similarity). Sampling for subsequent training proceeds 80% from top Pareto/composite candidates and 20% randomly, supporting both goal alignment and compound novelty.

This method guarantees that each gradient update is informed by informativeness, reward-optimality, and expert diversity—rather than mere policy reinforcement—leading to molecules that meet tailored pharmaceutical goals.

4. Linker Generation and Structural Redundancy Resolution

FragmentGPT’s linker generation operates as follows:

Given input as <p1>SMILES_A <p2>SMILES_C, the model learns to insert a chemically viable “linker” structure between fragments A and C.
Practical applications (e.g., designing bifunctional ligands such as PROTACs) show that FragmentGPT can propose linkers with specific scaffolds (e.g., imidazolinone) known for experimental stability and compatibility.
For redundant structures (e.g., duplicate aromatic rings in A+B / B+C merges), the model is trained using overlap-aware prompts (e.g., <p1>SMILES_A+SMILES_B <p2>SMILES_B+SMILES_C) and maximum common substructure (MCS)-based corpus construction. The model learns to merge overlapping structures into a chemically unified molecule.

This automated redundancy resolution reflects a significant advance over atom- or naive fragment-based approaches, which cannot easily address structural duplication.

5. Experimental Validation and Performance Metrics

Extensive experiments on real-world datasets (e.g., 1 million ZINC15 compounds docked to RTCB, a cancer protein):

Metrics: Validity, druglikeness, solubility, synthesizability (retrosynthesis-based), docking score (protein-ligand affinity simulation), and similarity (Tanimoto coefficient).
Empirical Results: Finetuned FragmentGPT achieves validity rates of 0.991–0.993 (linking/merging), high composite rewards, and outperforms the baseline methods such as Link-INVENT in both real and ablation experiments. Notably, the approach achieves high-scoring docking and retention of other desired pharmaceutical properties, evidencing successful multi-objective optimization.
Case Studies: In bifunctional molecule design (e.g., dBET6), FragmentGPT recovers known high-quality linkers and even infers correct functionalization sites (such as acyl couplings) not supplied during training. This suggests robust chemical pattern induction.

6. Applications in Drug Discovery

FragmentGPT unifies multiple FBDD workflows:

Direct application for fragment growing from screening “hits” toward potent leads.
Automated and chemically informed fragment linking between diverse subunits, relevant for PROTACs, bifunctional ligands, or dual-site inhibitors.
Redundancy-aware fragment merging, enabling the synthesis of optimized molecules without duplicated rings or superfluous groups—an operation not feasible with most generative models.
Enables rapid, policy-driven exploration for diverse chemical series meeting complex pharmaceutical constraints (docking, ADME, synthetic accessibility) via multi-objective aligned training.

The system is validated not only by simulated metrics, but also through expert-reviewed case studies and output visualizations, such as the identification of precise scaffolds critical for real-world drug design campaigns (e.g., imidazolinone).

7. Significance and Implications

FragmentGPT advances the state of generative molecular design by integrating:

A chemically grounded, energy-constrained fragment assembly protocol, ensuring all candidate molecules are constructed via realistic, synthesizable operations.
A scalable, goal-driven generative framework capable of optimizing for and balancing multiple, often conflicting, drug discovery objectives.
Automated redundancy resolution, extending generative capability beyond atom-wise or naive fragment recombination paradigms.

The model and training strategies directly address core challenges in FBDD, providing a single end-to-end framework for fragment growing, linking, and merging. This positions FragmentGPT as a significant contribution to computational medicinal chemistry and early-phase drug R&D, with demonstrated efficacy on challenging biological and therapeutic targets (Liu et al., 14 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

FragmentGPT: A Unified GPT Model for Fragment Growing, Linking, and Merging in Molecular Design (2025)

Follow Topic

Get notified by email when new papers are published related to FragmentGPT.