ChemBERTa: Chemical Transformer Models

Updated 23 February 2026

ChemBERTa is a family of transformer models pre-trained on SMILES strings for accurate molecular property prediction.
It leverages a RoBERTa-style architecture with self-supervised objectives like MLM and MTR on large chemical datasets.
Applications include property prediction, activity regression, and cross-modal alignment, outperforming traditional methods in benchmarks.

ChemBERTa is a family of large-scale transformer-based models pre-trained on chemical representations, specifically SMILES (Simplified Molecular Input Line Entry System) strings, for molecular property prediction. Developed through variants and extensions, including ChemBERTa-2, these models leverage self-supervised learning objectives adapted from natural language processing to encode molecular structure and activity information into vector representations. ChemBERTa models have been demonstrated as effective chemical foundation models, supporting downstream tasks such as property prediction, activity regression, cross-modal representation alignment, and virtual screening, often on a scale and breadth not attainable with traditional graph-based or fingerprint-based methods (Chithrananda et al., 2020, Ahmad et al., 2022, Zeng, 3 Dec 2025, Wang et al., 23 Jan 2026).

1. Model Architecture and Pretraining

ChemBERTa adopts the RoBERTa-style transformer encoder as its architectural backbone. The typical “Base” configuration employs 12 transformer encoder layers, each with hidden size $d_{\text{model}} = 768$ , 12 self-attention heads of size $d_k = 64$ , and a position-wise feed-forward layer of size $d_{\text{ff}} = 3072$ . Layer normalization and dropout (rate $\approx0.1$ ) are incorporated throughout. The parameter count ranges up to approximately 77 million (Zeng, 3 Dec 2025, Ahmad et al., 2022). Earlier versions used 6 layers and approximately 40–50 million parameters (Chithrananda et al., 2020).

ChemBERTa-2 builds directly on the HuggingFace RoBERTa framework and applies a SMILES-specific vocabulary (e.g., a 591-token alphabet). Input SMILES strings are padded or truncated to a maximum of 512 tokens (Ahmad et al., 2022, Wang et al., 23 Jan 2026).

Pretraining Objectives

Two principal self-supervised pretraining objectives are utilized:

Masked Language Modeling (MLM)

Random masking of 15% of SMILES tokens within a sequence is performed, and the model learns to predict each masked token given the unmasked context:

$\mathcal{L}_{\rm MLM} = -\sum_{i\in\mathcal{M}} \log P(x_i \mid x_{\setminus i})$

where $\mathcal{M}$ are the masked positions, $x_{\setminus i}$ is the input with token $i$ masked, and $P$ is the softmax over the model’s vocabulary (Chithrananda et al., 2020, Ahmad et al., 2022, Zeng, 3 Dec 2025).

Masked Token Regression (MTR) / Multi-Task Regression

In MTR, continuous-valued molecular properties or learned token embeddings form the regression targets:

$\mathcal{L}_{\rm MTR} = \sum_{i\in\mathcal{M}} \bigl\lVert h_i - e(x_i) \bigr\rVert^2$

with $d_k = 64$ 0 the transformer output and $d_k = 64$ 1 the embedding of the correct token (Zeng, 3 Dec 2025). In multi-task settings, up to $d_k = 64$ 2 properties are regressed per molecule:

$d_k = 64$ 3

where $d_k = 64$ 4 is the predicted value (Ahmad et al., 2022).

2. Pretraining Data and Tokenization

ChemBERTa models are pre-trained on large-scale unlabeled chemical datasets. The principal source is PubChem (2019 snapshot), yielding approximately 77 million unique, canonical SMILES strings after deduplication and standardization. Subsets as small as 5 million and as large as the full 77 million are used for scaling experiments (Chithrananda et al., 2020, Ahmad et al., 2022, Zeng, 3 Dec 2025).

Tokenization strategies include:

Character-level SMILES tokenization (591 tokens) for ChemBERTa-2 (Ahmad et al., 2022)
Byte-Pair Encoding (BPE) (vocab $d_k = 64$ 5) and custom regex-based schemes for earlier variants (Chithrananda et al., 2020)
Special tokens such as [CLS], [SEP], <mask>, <pad>, and <unk> are included

Sequences exceeding the maximum length are truncated or padded to 512 tokens (Chithrananda et al., 2020, Ahmad et al., 2022).

3. Downstream Applications and Fine-Tuning

ChemBERTa models are adapted to a range of molecular property prediction tasks through fine-tuning, with demonstrated applications including:

MoleculeNet benchmark regression and classification tasks: Properties such as BACE, Clearance, Delaney, Lipophilicity (Lipo), BBBP, ClinTox, HIV, and Tox21 SR-p53, using scaffold splits and Optuna-tuned training. Models are competitive with, or outperform, D-MPNN and other graph-based baselines on 6/8 tasks (Ahmad et al., 2022).
Activity regression and virtual screening: For TDP1 inhibitory potency, ChemBERTa predicts pIC₅₀ values from SMILES with a single linear regression head, utilizing weighted MSE loss:

$d_k = 64$ 6

where $d_k = 64$ 7 is a class-imbalance correcting weight (Zeng, 3 Dec 2025).

Cross-modal alignment: ChemBERTa embeddings serve as molecular targets for geometric alignment of other modalities, such as tandem mass spectrometry spectra, enabling retrieval and structure identification via cosine similarity in latent space (Wang et al., 23 Jan 2026).

4. Performance Benchmarks and Empirical Findings

Performance is extensively evaluated with both standard regression/classification metrics and domain-specific virtual screening criteria:

Enrichment and Precision Metrics in Virtual Screening

Enrichment Factor at 1% (EF@1%) and Precision@1% quantify performance in identifying actives among top-ranked compounds. In TDP1 activity prediction (Zeng, 3 Dec 2025):

Model	EF@1%	Precision@1%
ChemBERTa-MTR	17.4	37.4%
Random Forest	21.5	46.0%
Random Predictor	1.0	2.1%

ChemBERTa-MTR achieves substantial improvements over random and approaches Random Forest performance.

MoleculeNet Benchmark Results

ChemBERTa-2, pre-trained on 77M compounds, achieves:

Competitive or superior RMSE and ROC-AUC compared to D-MPNN and other baselines for several MoleculeNet tasks. For example, ChemBERTa-2 (MTR-77M) achieves Lipo RMSE of 0.798 and HIV ROC-AUC of 0.799 (Ahmad et al., 2022).
Scaling pretraining set size from 5M to 77M yields 25–35% reduction in loss and 5–10% improvement in RMSE or ROC-AUC for select tasks.

Freezing ChemBERTa as a molecular encoder and projecting other modality embeddings (such as mass spectra) into its latent space enables retrieval by cosine similarity, improving Recall@1 by 20-25 points relative to end-to-end neural approaches (Wang et al., 23 Jan 2026).

5. Pretraining and Transfer Objective Ablations

Evaluations consistently find that Masked Token Regression (MTR) outperforms Masked Language Modeling (MLM) for continuous property or affinity prediction, both in standalone benchmarks (e.g., TDP1 virtual screening) and broad MoleculeNet evaluations (Zeng, 3 Dec 2025, Ahmad et al., 2022). However, MLM remains useful for architecture search due to faster throughput.

Sample weighting to correct for severe activity imbalance shows superior performance relative to naïve oversampling approaches (e.g., EF@1% = 17.4 for weighting versus 11.1 for oversampling; Precision@1% similarly improved) (Zeng, 3 Dec 2025).

Pretraining alignment impacts transfer: certain endpoints such as lipophilicity benefit linearly from improved pretraining, while others exhibit early saturation. MTR’s advantages are especially pronounced in metrics coupled to the “early enrichment” regime (i.e., success among top-ranked predictions).

Implementation relies on the HuggingFace Transformers ecosystem for encoder architectures and tokenization (Chithrananda et al., 2020). Pretraining and fine-tuning optimization use AdamW with linear learning-rate decay and warmup; batch sizes range from 32 (fine-tuning under heavy imbalance) to 256 (pretraining). Early stopping and Optuna Bayesian optimization are applied for hyperparameter search.

Pretraining with 77M compounds requires multi-GPU resources, converging in ≈5 days on 4×T4 nodes for ChemBERTa-2. Comprehensive pretrained checkpoints and tokenizers are made openly available, within the constraints of dual-use risk assessment (Chithrananda et al., 2020, Ahmad et al., 2022).

The frozen ChemBERTa latent space serves as a “chemical manifold” for downstream alignment by other modalities, as exemplified by SpecBridge, which projects mass spectrometry spectra into ChemBERTa space via a lightweight adapter. This approach enables rapid structure retrieval and demonstrates increased stability and retrieval accuracy versus fully end-to-end contrastive models (Wang et al., 23 Jan 2026).

Suggested future extensions include integration of uncertainty quantification for higher-confidence predictions, application to multi-target or polypharmacology regression, and coupling ChemBERTa representations with generative chemistry frameworks for closed-loop molecular design (Zeng, 3 Dec 2025). The prospect of hybrid string+graph objectives and benchmarking against emerging graph-based chemical foundation models are identified as important lines of investigation (Ahmad et al., 2022).

ChemBERTa’s approach—large-scale, self-supervised pretraining on SMILES, with flexibility in pretraining objectives and demonstrated robustness across property prediction, virtual screening, and cross-modal retrieval—positions it as a central chemical LLM and a practical foundation for both computational drug discovery and molecular informatics.