Chemical Language Models (CLMs)

Updated 23 June 2026

Chemical Language Models (CLMs) are neural sequence models that view molecular representations like SMILES and SELFIES as linguistic tokens for tasks such as automated molecular generation and property prediction.
They employ Transformer-based architectures with advanced tokenization strategies, ensuring high validity and improved interpretability in chemical informatics.
Recent developments in CLMs include multimodal integrations and reinforcement learning approaches, enhancing de novo molecular design and predictive accuracy.

Chemical LLMs (CLMs) are neural sequence models that treat molecular string representations—such as SMILES, SELFIES, or InChI—as linguistic objects, learning probability distributions over token sequences to enable automated molecular generation, property prediction, and reaction understanding. Inspired by advances in natural language processing, CLMs employ deep architectures—most commonly Transformer-based—to capture the grammar and semantics of chemistry at a scale matching or exceeding classical graph-based methods. Their introduction has led to rapid developments in molecular design, virtual screening, chemical informatics, and the convergence of chemistry with generative AI paradigms.

1. Molecular Representations and Tokenization Strategies

CLMs operate on 1D line notations encoding molecular graphs, the most prevalent being:

SMILES (Simplified Molecular Input Line Entry System): Encodes connectivity, atomic symbols, bonds, stereochemistry, and ring closures as a text string. Widely used for both training and inference, but exhibits sensitivity to syntax and variations due to non-uniqueness and canonicalization differences (Janakarajan et al., 2023, Kikuchi et al., 11 May 2025). Character-level or substructure-level tokenization (e.g., regex-based atom tokens, SMILES-PE via byte-pair encoding) is common.
SELFIES (Self-Referencing Embedded Strings): Graph-based, 100% robust to string mutation, with tokens strictly enforcing valence and structure constraints, avoiding the invalidity problem of SMILES. This enables near-perfect validity in generative tasks (Salij et al., 30 Mar 2026, Flam-Shepherd et al., 2021).
GroupSELFIES and Fragment Encodings: Higher-level tokens representing functional groups, rings, or larger substructures extracted by BPE or fragment libraries (e.g., BRICS). Fragment-based tokens enable more efficient and chemically meaningful grammar induction, accelerate inference, and support interpretability (Salij et al., 30 Mar 2026, Zhu et al., 14 Jul 2025).
Other Modalities: For multimodal or cross-domain CLMs, chemical images (e.g., Chemistry OCR, reaction schemes), protein sequences, or molecular graphs/3D geometries are integrated through image encoders, graph neural networks, or geometric transformers (Li et al., 2024, Lv et al., 2024, Singh et al., 20 Mar 2025).

Token vocabularies thus span atom types, bond types, stereochemical markers, special control tokens ([START], [END], [PAD]), and possibly natural language or protein context tokens when handling cross-modal inputs (Janakarajan et al., 2023, Singh et al., 20 Mar 2025).

2. Model Architectures and Learning Objectives

CLMs predominantly employ Transformer-based sequence models, leveraging the following variants:

Decoder-only (Autoregressive/GPT style): Stack of causal self-attention layers, generating molecular tokens sequentially (left-to-right). Used for de novo molecular generation and property-conditioned design (Salij et al., 30 Mar 2026, Singh et al., 20 Mar 2025, Cavanagh et al., 2024). Fragment-based vocabularies (e.g., GroupSELFIES) have been shown to enhance syntactic correctness and synthetic accessibility.
Encoder-only (BERT style): Learned bidirectional representations via masked language modeling (MLM), yielding fixed-length embeddings for property inference or similarity search (Janakarajan et al., 2023, Mostafanejad et al., 13 Mar 2026, Sagawa et al., 12 Feb 2026). Rotational relative positional encodings, layer normalization, and multi-head self-attention enable the capture of both local SMILES grammar and global chemical semantics (Kenneth et al., 22 Jun 2026).
Encoder–Decoder (Seq2Seq/Transformer): Bidirectional encoder for input (e.g., SMILES or text), unidirectional decoder for output (e.g., reaction product, caption, or cross-domain target), sharing parameter weights across chemistry and text tasks (e.g., multitask T5, SMI-TED) (Christofidellis et al., 2023, Soares et al., 2024).

Key training objectives:

Objective Type	Mathematical Formulation	Task Domain
Next-token prediction	$L_{CE} = -\sum_{t=1}^N \log p_\theta(x_t\|x_{<t})$	Autoregressive generation
Masked LM (MLM)	$L_{MLM} = -\sum_{i\in M}\log P(x_i\|x_{<i},x_{>i};\theta)$	Representation learning
Conditional LM	$L(\theta) = \sum_{(c,s)\in D} \log P_\theta(s\|c)$	Conditional generation
Cross-modal contrastive	$L_{NCE}= -\frac{1}{N}\sum y\log k(f(m),g(a)) + (1-y)\log(1-k(f(m),g(a)))$	Zero-shot property prediction
DPO/reinforcement	$L_{DPO} = -\sum\log\sigma(\alpha[s_\theta(y^+\|x)-s_\theta(y^-\|x)])$	Preference optimization

Data curation emphasizes large molecular corpora (PubChem, ZINC, ChEMBL, UniChem), with pretraining sets spanning 8–1,000M molecules and 4×10⁹–40×10⁹ tokens (Salij et al., 30 Mar 2026, Singh et al., 20 Mar 2025, Soares et al., 2024). Fine-tuning on small, specialist, or property-annotated sets is standard for domain adaptation, with LoRA adapters or preference-based objectives enabling efficient transfer (Salij et al., 30 Mar 2026, Cavanagh et al., 2024).

3. Functional Capabilities: Generation, Prediction, and Interpretation

CLMs are applied to both generative and predictive tasks:

De Novo Molecular Generation: Autoregressive sampling conditioned on property or biological context; fragment-level or atom-level generation with explicit validity constraints (SELFIES, group tokens) (Salij et al., 30 Mar 2026, Zhu et al., 14 Jul 2025, Singh et al., 20 Mar 2025). RL or DPO can be used to bias toward desired properties (QED, binding affinity) (Cavanagh et al., 2024).
Property Prediction: Fixed-size molecular embeddings (e.g., [CLS] token) passed to feed-forward heads for regression (MSE loss), classification (cross-entropy), or ranking (Mostafanejad et al., 13 Mar 2026, Soares et al., 2024, Janakarajan et al., 2023). Benchmarks include MoleculeNet tasks (ROC-AUC, RMSE/MAE), MoleculeNet, GuacaMol, and MOSES (Sagawa et al., 12 Feb 2026, Soares et al., 2024).
Reaction and Retrosynthetic Analysis: Sequence-to-sequence models for translating input reactants/products to outputs, fine-tuned on reaction databases, evaluated by top-k accuracy or route recovery (Janakarajan et al., 2023, Christofidellis et al., 2023).
Cross-Modal and Multimodal Reasoning: Integration of textual, structural, and visual modalities via pretrained vision encoders, fragment/topology/conformer fusion (HME), or cross-attention with protein representations (Lv et al., 2024, Li et al., 2024, Singh et al., 20 Mar 2025).
Interpretability: Sparse autoencoder dissection of CLM residual streams reveals monosemantic features for substructures, functional groups, and property correlates. Ablation or activation can modulate generation, attributing structure–activity relationships at the fragment level (Cohen et al., 8 Dec 2025, Kenneth et al., 22 Jun 2026, Zhu et al., 14 Jul 2025).

4. Evaluation Metrics, Scaling, and Benchmarks

Standardized evaluation protocols for CLMs encompass:

Metric	Definition / Role
Validity	Fraction of generated strings parsed as valid molecules (SELFIES: ≈100%)
Uniqueness	Fraction of non-duplicate valid molecules in a sample batch
Novelty	Share not present in the training set
Diversity	$1 - \text{mean(Tanimoto similarity across pairs)}$
SA (Synthetic Accessibility) Score	Ertl’s score: lower = easier to synthesize
KL-score, FCD	Distributional similarity to reference libraries (GuacaMol, Fréchet ChemNet Distance)
ROC-AUC, RMSE/MAE	Property prediction/regression evaluation
Enrichment factor	Hit rate at top 1% in virtual screening
Fragment-level attribution	Contribution of each fragment to model log-likelihood or property prediction

Scaling studies show pretraining loss and pseudo-perplexity decrease with increasing data and model size following power laws, but downstream property prediction performance often plateaus or exhibits task-dependent negative transfer beyond moderate scale (≈5–10B tokens, 25–50M parameters) (Mostafanejad et al., 13 Mar 2026, Sagawa et al., 12 Feb 2026). Saturation and even degradation on some tasks highlight the necessity of validation beyond MLM loss and careful model/data selection.

5. Advancements: Conditional, Multimodal, and Chem–Language Foundation Models

Recent developments include:

Conditional and Biological Context-Aware Models: Models such as SAFE-T and Chem42 extend CLMs with property, fragment, or protein-condition inputs via prompt concatenation or cross-attention, supporting virtually instant screening, goal-directed generation, and interpretable fragment-level SAR attribution (Zhu et al., 14 Jul 2025, Singh et al., 20 Mar 2025).
Multimodal CLMs: ChemVLM and HME represent models incorporating chemical images, reaction schemes, text descriptions, 2D/3D molecular graphs, and fragment sequences. Fused representations enable robust cross-domain retrieval, generation, and captioning with state-of-the-art distributional coverage and chemical reasoning (Li et al., 2024, Lv et al., 2024).
Text–Chemistry Unified Transformers: Models such as Text+Chem T5 employ shared encoder–decoder architectures for joint learning across chemical and natural language inputs, demonstrating that multi-task/multi-modal pretraining may yield foundation models for chemistry capable of solving property prediction, molecule captioning, textual molecular synthesis, and experimental protocol extraction within a common parameterization (Christofidellis et al., 2023).
RL and Scaling Laws for Exploration: CLMs amplified via reinforcement learning—often at test time, using many independent agents—exhibit log-linear improvements in chemical-space exploration for de novo design, with parallel short RL episodes preferred over long single-agent trajectories. Benchmarks such as MolExp measure true multimodal rediscovery capability and exploration efficiency (Thomas et al., 31 Jan 2025).
Interpretability Tools and Mechanistic Analysis: Sparse autoencoders unveil layerwise latent organization (from syntax/grammar in early layers to semantic/functional features in late layers), providing a mechanistic understanding of how CLMs parse and abstract molecular information, and guiding model improvement toward semantic invariance and active steerability (Cohen et al., 8 Dec 2025, Kenneth et al., 22 Jun 2026).

6. Practical Considerations, Limitations, and Standardization

Reproducibility: SMILES canonicalization protocols (toolkit and version) and complete stereochemical annotation must be declared and standardized throughout pretraining and evaluation (Kikuchi et al., 11 May 2025, Mostafanejad et al., 13 Mar 2026). Nonstandardized or inconsistent representations degrade translation accuracy and latent space comparability, though downstream property tasks are more robust due to label-driven compression.
Model and Data Selection: Larger models and datasets improve representation learning but often yield diminishing or negative returns on specialized property tasks. In practice, scaling beyond moderate size (≈25–50M parameters, ≈5–10B tokens) requires careful task-dependent tuning, small validation-based early stopping, and possibly ensembling checkpoints rather than depending on pretraining loss proxies (Sagawa et al., 12 Feb 2026, Mostafanejad et al., 13 Mar 2026).
Interpretability, Safety, and Deployment: Sparse interpretable representations support mechanistic steering (feature ablation, prototype activation), bias detection, and removal of undesirable property correlates, which is critical for deployment in high-stakes design scenarios (Cohen et al., 8 Dec 2025, Kenneth et al., 22 Jun 2026).
Current Gaps and Future Trends: Reliance on 1D string representations limits unambiguous encoding of 3D structures and stereochemistry, motivating increased adoption of multimodal and graph-based modules (Li et al., 2024, Lv et al., 2024). Emerging directions include integration of protein/target context, multimodal foundation models, active learning cycles for low-data domains, and human-in-the-loop optimization empowered by interactive interfaces (Janakarajan et al., 2023, Christofidellis et al., 2023, Salij et al., 30 Mar 2026).

7. Impact and Outlook

CLMs have transformed computational chemistry, bringing the flexibility, scale, and creative capabilities of LLMs to molecule-centric domains. By enabling efficient de novo design, property screening, reaction prediction, and integrative reasoning across textual, structural, and biological modalities, these models compress extensive "design–make–test–analyze" cycles into AI-accelerated workflows (Salij et al., 30 Mar 2026, Janakarajan et al., 2023). Limitations due to representation ambiguity, overfitting in large-scale pretraining, and sensitivity to data/notation remain active areas of research. With the advent of cross-modal and foundation CLMs, the next phase promises unified, interactive, and highly generalizable platforms for chemical discovery and molecular engineering.