MoLFormer: Transformer Models for Molecules

Updated 5 December 2025

MoLFormer is a family of transformer-based architectures that leverage both 3D geometry-aware graph models and SMILES-based language models for molecular representation.
The models utilize heterogeneous self-attention, multi-scale masking, and functional group masking to achieve state-of-the-art performance across quantum, physiological, and activity prediction tasks.
MoLFormer enables scalable de novo molecule generation, scaffold decoration, and property-guided optimization for structure-based drug design and chemical optimization.

MoLFormer refers to a family of transformer-based architectures designed for molecular representation, property prediction, and de novo molecule generation. The MoLFormer lineage spans both 3D geometry-aware graph transformers and high-throughput chemical LLMs operating over SMILES, sharing critical innovations in attention mechanisms, positional encoding, large-scale pre-training, and molecular motif handling. These models underpin state-of-the-art performance across quantum, physiological, and activity prediction tasks, as well as scalable molecular generation suitable for structure-based drug design and chemical optimization.

1. Foundational Architecture and Model Variants

1.1 Motif-based 3D Graph Transformer

The original "Molformer" introduced a transformer variant for 3D heterogeneous molecular graphs (HMGs), where nodes comprise both atom-level nodes $x_{1}, \ldots, x_{N}$ , with geometric and chemical invariants, and motif-level nodes $m_{1}, \ldots, m_{M}$ representing functional substructures. Atom features include embeddings $x_i \in \mathbb{R}^d$ , 3D coordinates $p_i \in \mathbb{R}^3$ , and auxiliary attributes; motif embeddings $x_{m_j} = W^M\, \mathrm{one\_hot(category_j)}$ are positioned via atomic-weight centroids. All node pairs are fully connected, with relation types $\varphi(i, j) \in \{\text{atom-atom}, \text{atom-motif}, \text{motif-motif}\}$ and geometric distances $d_{ij} = \|p_i - p_j\|_2$ (Wu et al., 2021).

Model architecture consists of $L$ heterogeneous self-attention (HSA) layers (typ. $L=6$ ), with each attention head processing radius-based, multi-scale geometric masking. The HSA formulation encodes spatial structure via learned kernels on pairwise distances, relation-specific biasing, and multi-scale feature aggregation. The output representation is distilled via attentive farthest point sampling (AFPS).

1.2 SMILES-based LLMs

Subsequent MoLFormer variants adopt transformer encoder–decoder architectures over tokenized SMILES molecular strings, leveraging linear self-attention (FAVOR+) and rotary positional embeddings for scalable, high-dimensional sequence modeling (Ross et al., 2021, Ross et al., 4 Apr 2024). These models (e.g., MoLFormer-XL, GP-MoLFormer) are pretrained on over 1.1 billion molecules, using masked language modeling or autoregressive objectives, producing universal molecular embeddings or generative policies, respectively. The standard configuration employs 12 transformer layers, $d=768$ , 12 heads per layer, and up to 48M parameters.

Functional group–aware pre-training strategies such as random functional group masking further enhance chemical inductive bias by forcing reconstruction of entire chemically meaningful subgraphs instead of random token spans, thereby improving structure–property generalization (Peng et al., 3 Nov 2024).

2. Geometric and Structural Representation Strategies

2.1 Heterogeneous Molecular Graphs and Motif Extraction

HMGs encode molecules as multi-scale graphs incorporating both atom-level and substructure-level (motif) nodes. Small-molecule motifs derive from functional group categories—hydrocarbons, haloalkanes, oxygen-containing, and nitrogen-containing groups—detected via SMARTS/RDKit matching. Protein graph motifs are generated using reinforcement learning acting on ProtBERT embeddings of amino acid segments, with an explicit diversity reward to maximize motif set coverage (Wu et al., 2021).

2.2 Geometry-aware Attention

In 3D-aware MoLFormer models, the HSA module augments transformer self-attention with explicit geometric kernels. The geometry encoding applies a learned convolution over the pairwise distance matrix, ensuring roto-translation invariance, and supplies per-relation-type biases. Multi-scale masking restricts attention neighborhoods based on three preset distance cutoffs, yielding both local- and global-scale aggregation.

For Moleformer and similar designs, attention biases are computed from rotation and translation–invariant primitives such as interatomic distances and angles, capturing higher-order interactions by embedding both nodes and edges and directly encoding angles and torsions into the attention mechanism (Yuan et al., 2023).

2.3 Random Functional Group Masking

MoLFormer variants using SMILES as input achieve structural awareness via functional group masking. Instead of default BERT-style token masking, functional group (FG) masking employs RDKit to find substructure spans, masking all associated tokens per pretraining step. The masked-language modeling loss is then focused on reconstructing chemically coherent fragments rather than unstructured token noise:

$L(\theta) = -\sum_{i \in G} \log\,p_\theta(x_i \mid \mathbf{x}'_{\backslash G}).$

This forces the transformer to capture intra- and inter-group dependencies and implicitly develop a topology-aware, structure-sensitive representation (Peng et al., 3 Nov 2024).

3. Training Paradigms and Pre-Training Corpora

MoLFormer models are pre-trained on massive unlabeled SMILES corpora from PubChem and ZINC, exceeding $1.1 \times 10^9$ unique chemical structures. Sequences are canonicalized, tokenized (Schwaller tokenizer), and truncated/padded to 202 tokens, achieving $>99.4\%$ coverage of molecule length. Pre-training employs masked language modeling for encoders or causal autoregressive objectives for GP-MoLFormer and derivative decoders (Ross et al., 2021, Ross et al., 4 Apr 2024).

Key efficiency improvements arise from:

Linear attention: using random feature maps, reducing $O(N^2)$ complexity to $O(N)$ and enabling batch sizes of 1,600 molecules per GPU.
Rotary positional encodings: substituting absolute position embeddings for oscillatory bias in the attention computation, resulting in superior handling of molecular structural variations.
Sequence-length bucketing: limits padding overhead during batching.

Functional group masking is applied during pre-training, with a set $\mathcal{F}$ of $K$ SMARTS-defined functional groups, selecting random group instances per molecule per batch (Peng et al., 3 Nov 2024).

4. Downstream Task Performance and Benchmarking

MoLFormer exhibits strong performance across classification and regression tasks, including MoleculeNet and QM molecular property benchmarks.

Representative Results

Task (Dataset)	MoLFormer/Variant	Metric/Value	Notable Baseline(s)	Value
QM7 (Quantum Reg.)	Motif-based	MAE 43.2	GraphTransformer	47.8
QM8	Motif-based	MAE 0.009	DMPNN	0.014
QM9	Motif-based	MAE competitive (.032–.039)	DimeNet++/SphereNet	.032–.032
BBBP (Classify)	Encoder/FG-masked	ROC-AUC 0.926/0.9055	AttentiveFP/MolCLR	0.908/0.9307
ESOL (Solubility)	FG-masked	RMSE 0.3432	GEM	0.7614

Additional highlights:

Fine-tuned MoLFormer-XL achieves strong Pearson correlation (0.64/–0.60) to classical ECFP/MCS similarity metrics, reflecting its structural encoding power (Ross et al., 2021).
Random FG-masked MoLFormer outperforms or is competitive with graph-based and geometry-augmented models in 9/11 core benchmarks (Peng et al., 3 Nov 2024).
Geometry-aware MoLFormer variants achieve state-of-the-art accuracy on OC20 energy prediction (MAE 459 meV) and competitive QM9 quantum property prediction (Yuan et al., 2023).

5. Molecular Generation, Optimization, and Memorization

The GP-MoLFormer variant extends MoLFormer to autoregressive molecular string generation and molecular optimization tasks (Ross et al., 4 Apr 2024). The model can be deployed for three central domains without retraining:

De novo generation: Samples exhibit high validity ( $>99\%$ ), high uniqueness, and nontrivial novelty (e.g., 16.7%—39% for 10M generations depending on data deduplication).
Scaffold decoration: Scaffold-anchored generation achieves competitive activity retention.
Property-guided optimization: Pair-tuning, a parameter-efficient soft-prompt adaptation over molecular pairs ordered by property, delivers superior QED/logP and activity optimization compared to state-of-the-art graph-based methods.

Memorization and novelty analyses show duplication in training data directly impacts model output diversity. De-duplication raises novelty by 7–8 percentage points across sampling sizes (Ross et al., 4 Apr 2024). No analytic scaling law is reported for novelty versus sampling compute, but empirical relations (novelty $\propto$ generation size ${}^{-0.1}$ ) are presented.

6. Guided and Conditional Molecular Generation

Similarity-guided generation is enabled via GP-MoLFormer-Sim (Navratil et al., 5 Jun 2025), which steers the autoregressive decoding policy at test time using contextual similarity to one or more guide molecules. At each step, the cosine similarity between model hidden states for candidate continuations and corresponding prefix states from guide molecules adjusts the sampling logits:

$u'_i = \frac{1}{\tau}\left[(1-\alpha)u_i + \alpha\bar{S}_i\right]$

where $u_i$ are original CLM logits, $\bar{S}_i$ the average cosine similarity to guides, $\alpha$ a mixing hyperparameter, and $\tau$ a temperature. This method, incorporated as a mutation operator in a genetic algorithm, yields sample-efficient black-box molecular optimization, outperforming or matching leading training-free GA optimizers in property optimization, molecular rediscovery, and similarity-constrained design (Navratil et al., 5 Jun 2025).

7. Limitations, Outlook, and Comparative Perspectives

Limitations across the MoLFormer family include:

Residual outperformances by geometry-native GNNs in specialized quantum energy benchmarks (Ross et al., 2021).
Idiosyncratic SMILES grammar–induced biases, suggesting robustness may be improved by alternative line notations (e.g., SELFIES).
In motif-based graph variants, multi-scale masking and motif-lexicon choices affect performance variably by dataset scale and molecule type (Wu et al., 2021).
Memorization–novelty tradeoffs necessitate careful curation of pre-training data for generative deployments (Ross et al., 4 Apr 2024).

A plausible implication is the MoLFormer paradigm, combining large-scale self-supervision, geometric and chemical inductive biases, and sequence-efficient transformers, is serving as a “foundation model” approach for the computational chemistry community. This enables rapid fine-tuning and controlled sampling for virtual screening, activity cliff navigation, and property-conditional generation—subject to ongoing improvements in structure-aware pre-training and generative control (Ross et al., 2021, Peng et al., 3 Nov 2024, Navratil et al., 5 Jun 2025).