Fragment-Based Molecular Language Modeling

Updated 18 December 2025

Fragment-based molecular language modeling is a framework that decomposes molecules into chemically meaningful fragments for generative and representation learning.
It employs advanced tokenization along with transformer, VAE, and hybrid architectures to boost validity, diversity, efficiency, and property control in chemical design.
The approach integrates 3D geometry and protein context to facilitate structure-based drug discovery and scalable, high-novelty molecular generation.

Fragment-Based Molecular Language Modeling is a framework for generative and representation learning in chemistry that decomposes molecules into chemically meaningful fragments, encodes them into discrete sequences, and models molecular assembly, transformation, or exploration via language-model architectures. This approach departs from atom-level or character-level schemes—such as SMILES string modeling—by leveraging the inductive chemical bias present in substructural motifs and exploiting fragment-level representations for improved validity, diversity, efficiency, and property control. Recent developments cover transformer-based, VAE-based, and hybrid systems, with explicit integration of property conditioning, 3D geometry, protein pocket context, and fragment-based editing and visualization.

1. Fragment Identification, Tokenization, and Vocabulary Construction

Fragment definition and extraction underpin the paradigm. Rule-based fragmentation via BRICS (Podda et al., 2020), RECAP (Lv et al., 30 Dec 2024), or MMPA (Wu et al., 2023) provides atom-labeled substructures; learned tokenization (FragmentNet) uses graph merging with Weisfeiler–Lehman hashing (Samanta et al., 3 Feb 2025). Corpus-wide enumeration results in vocabularies of typically 10³–10⁵ unique fragments (e.g., FragAtlas-62M with 168,537 raw/21,085 masked fragments (Podda et al., 2020), or >62 M unique SMILES fragments (Ho et al., 23 Sep 2025)). Advanced grammars support group tokens (Group SELFIES) with formally guaranteed chemical validity (Cheng et al., 2022).

Fragment-token assignment is either direct—each unique fragment is mapped to a token—or surrogate—high-frequency SMILES substrings or learned merges are mapped to substring/group tokens (as in SPE (Hu et al., 15 Jan 2024) and FragmentNet (Samanta et al., 3 Feb 2025)). Low-frequency masking (LFM) reduces active vocabulary and facilitates rare fragment sampling (Podda et al., 2020).

Table: Fragment Tokenization Strategies

Method	Fragment Identification	Token Representation	Typical Vocabulary Size
BRICS	Retrosynthetic bond cleavage	SMILES fragments	10⁴–10⁵
FragmentNet	Graph merge via edge frequency	Graph hash IDs	10⁴–10⁵ (learned tokens)
Group SELFIES	Subgraph isomorphism of functional groups	Group tokens, atomic tokens	10²–10³ (customizable)
SPE	SMILES BPE of frequent substrings	Substring tokens	10²–10³

Chemically valid attachment points are tracked via dummy atoms or atom labels so that reassembly only yields feasible molecules (Podda et al., 2020, Samanta et al., 3 Feb 2025, Cheng et al., 2022). Fragment-based grammars (t-SMILES, Group SELFIES) support unambiguous encoding and decoding (Wu et al., 2023, Cheng et al., 2022).

2. LLM Architectures and Learning Objectives

Fragment-based molecular LLMs employ both sequence and graph architectures. Canonical approaches implement decoder-only transformers (GPT-2 backbone (Ho et al., 23 Sep 2025, Liu et al., 14 Sep 2025)), encoder–decoder VAEs (Podda et al., 2020, Cheng et al., 2022), and hybrid multi-modal systems with parallel representations (HME: fragment sequence, 2D topology, 3D conformation (Lv et al., 30 Dec 2024); FragmentNet: VQ-VAE+GCN fusion (Samanta et al., 3 Feb 2025)).

Autoregressive generation is ubiquitous: $L_{\mathrm{AR}}(\theta) = -\sum_{t=1}^{N} \log P(f_t \mid f_{<t})$ for fragment-token sequences. VAEs employ latent regularization and maximize likelihood over fragment sequences (Podda et al., 2020, Cheng et al., 2022). Masked Fragment Modeling (MFM (Samanta et al., 3 Feb 2025)) and edit-based recovery with fragment-level supervision (SMI-Editor (Zheng et al., 7 Dec 2024)) provide alternatives for training robust fragment understanding.

Fragment conditioning—supporting property control and fragment constraints—is achieved by property vectors integrated into input features (Seo et al., 2021, Ortega-Ochoa et al., 10 Nov 2024), fragment-prompt padding (Liu et al., 14 Sep 2025), or Q-learning-based compression (Lv et al., 30 Dec 2024). Edit-based models target fragment recovery as an editing policy over disrupted SMILES (Zheng et al., 7 Dec 2024).

Hybrid systems fuse learned fragment, graph, and geometry representations. MolMiner leverages order-agnostic story factorization, symmetry-aware attachments, and geometry-biased attention (Ortega-Ochoa et al., 10 Nov 2024). 3D-aware approaches predict local and global coordinates per fragment (Lingo3DMol (Feng et al., 2023), Frag2Seq (Fu et al., 19 Aug 2024)) via Transformer models with SE(3)-equivariant frames, attention biases, and cross-attention to protein pockets.

3. Generative Pipelines, Editing, and Assembly Tasks

Fragment-based generative models synthesize molecules by sequentially assembling fragments according to attachment rules. In FragmentGPT (Liu et al., 14 Sep 2025), molecule construction proceeds via fragment-growing, linking, and merging steps, with fragment-prompted GPT-2 autoregressive prediction and explicit energy-based bond-cleavage initialization. Trio (Ji et al., 10 Dec 2025) integrates property-aligned fragment-sequence generation with Monte Carlo Tree Search (MCTS), docking-based scoring, and direct preference optimization.

FragmentNet (Samanta et al., 3 Feb 2025) enables adaptive graph-to-sequence conversion, systematic fragment swapping, and visualization of fragment-level learned embeddings. SMI-Editor (Zheng et al., 7 Dec 2024) reconstructs entire molecules from fragment-dropped inputs via edit operations, guided by dynamic programming over the Levenshtein path.

In 3D-generation pipelines, fragments are placed in geometric context: Lingo3DMol serializes fragment-annotated SMILES with local/global coordinate prediction, leveraging pairwise attention biases from spatial arrangements (Feng et al., 2023). Frag2Seq encodes fragments as SE(3)-invariant 7-tuples (SMILES, bond length, orientation, rotation vector) for pocket-aware generation via layer-wise cross-attention (Fu et al., 19 Aug 2024).

4. Property Control, Feasibility, and Diversity

Fragment-level modeling enhances property control relative to atom-level modeling by aligning chemical properties with substructure choices. BRICS or MMPA fragmentation encodes synthetically accessible units (Seo et al., 2021, Podda et al., 2020), while property vectors are used to condition fragment choices during the generative trajectory. Graph-convolutional embedding enables direct learning of fragment-property contributions (Seo et al., 2021).

Direct and soft property alignment mechanisms are notable:

Property-aligned selection: embedding and scoring of partial molecules and candidate fragments towards user-specified targets, requiring no auxiliary regression (Seo et al., 2021).
Preference alignment: offline ranking of generated molecules, followed by optimization of the generator to prefer higher QED or lower SAS (Ji et al., 10 Dec 2025).
Edit-based fragment recovery models exhibit clear sensitivity to fragment deletion, correlating strongly with property changes (Zheng et al., 7 Dec 2024).

Frequency-based masking and sampling strategies promote diversity by flattening the fragment frequency distribution (Podda et al., 2020). Masked fragment tokens facilitate infrequent fragment generation.

Novelty and diversity metrics demonstrate substantial advances in maintaining uniqueness and avoiding training-set collapse. For FragAtlas-62M, 22 % of fragments are novel (not in ZINC), while validity approaches 99.9 % (Ho et al., 23 Sep 2025). t-SMILES and Group SELFIES maintain high novelty under extended training, circumventing the "striking similarity" pitfall (Wu et al., 2023, Cheng et al., 2022).

Recent fragment-based models explicitly incorporate geometry and protein context for structure-based drug design or property-targeted design:

3D geometry: MolMiner updates fragment coordinates via force-field optimization at every generative step, using geometry-biased softmax attention and symmetry-standardized attachments (Ortega-Ochoa et al., 10 Nov 2024). Fragment-based SMILES representations carry both local (bond, angle, dihedral) and global (Cartesian) coordinates for each token (Feng et al., 2023).
Protein context: Frag2Seq uses cross-attention between ligand fragment tokens and ESM-IF1-derived protein pocket embeddings for pocket-aware ligand assembly (Fu et al., 19 Aug 2024). Lingo3DMol employs an NCI/anchor classifier to guide the placement of fragment connection points within pockets (Feng et al., 2023).
Heterogeneous encoding: HME fuses fragment-sequence, 2D graph, and 3D conformer embeddings into a unified representation, compressed via Q-learning, supporting molecular description generation and property-constrained design (Lv et al., 30 Dec 2024).

Evaluation in SBDD contexts demonstrates large improvements in binding affinity, drug-likeness, and synthetic accessibility over atom-level and graph baselines, with substantial improvements in sampling speed (up to 300x) and property calibration (Fu et al., 19 Aug 2024, Feng et al., 2023, Ji et al., 10 Dec 2025).

6. Evaluation Metrics, Comparative Benchmarking, and Scalability

Quantitative assessment employs validity, novelty, uniqueness (Podda et al., 2020), property calibration plots (Ortega-Ochoa et al., 10 Nov 2024), distributional similarity metrics (KLD, FCD, effect size d (Ho et al., 23 Sep 2025, Wu et al., 2023)), and property-control errors (Seo et al., 2021).

Benchmarking on large datasets (ZINC-22, ChEMBL, MoleculeNet, CrossDocked, DUD-E) consistently shows fragment-based models matching or exceeding validity and property similarity of graph-based and atom-level models while preserving computational tractability:

Fragment-based VAE with LFM: Validity = 1.000, Novelty = 0.995, Uniqueness = 0.998 (Podda et al., 2020).
FragAtlas-62M: valid generation rate 99.9 %, retention of known fragments 53.6 %, novelty 22 %, effect size |d| < 0.4 for all descriptors (Ho et al., 23 Sep 2025).
FragmentNet ROC-AUC 94.0 (BBBP), RMSE 0.722 (ESOL), competitive with 100M-param graph models (Samanta et al., 3 Feb 2025).
t-SMILES: novelty up to 0.941, FCD up to 0.909, robust under long training (Wu et al., 2023).
Lingo3DMol, Frag2Seq: state-of-the-art DUD-E/CrossDocked performance in Vina, QED, SAS, speedup factors 50–300x (Feng et al., 2023, Fu et al., 19 Aug 2024).
SMI-Editor: ROC-AUC = 77.8 (MoleculeNet), outperforming both SMILES and 3D-GNN baselines (Zheng et al., 7 Dec 2024).

Interoperability with graph, SMILES, SELFIES, and multiscale (hybrid) codes is supported (Wu et al., 2023, Cheng et al., 2022). Scalable training with fragment-based tokens enables efficient pretraining over datasets of tens of millions of molecules on commodity hardware (Ho et al., 23 Sep 2025).

7. Implications, Applications, and Outlook

Fragment-based molecular language modeling unifies chemical validity, property control, and scalable generative design. It is the method of choice for fragment-based drug discovery, structure-based ligand generation, and general-purpose chemical language modeling. Foundation models such as FragAtlas-62M provide open resources for large-scale fragment-centric sampling and fine-tuning (Ho et al., 23 Sep 2025), while integration of geometry and protein context positions the paradigm at the forefront of SBDD (Feng et al., 2023, Fu et al., 19 Aug 2024).

Challenges remain in the selection of optimal fragmentation schemes, vocabulary design, and further scaling to larger model sizes and more diverse chemical domains. Ongoing research explores integration of stereochemistry, dynamic fragment grammars, enhanced property conditioning, and hybrid multimodal architectures. The area is poised for continued rapid advancement in both generative chemistry and molecular representation learning.