Bemis–Murcko Scaffold Split

Updated 22 February 2026

The method defines a molecule's core framework by extracting rings and linkers to assign all molecules with the same scaffold into a single fold.
It uses a greedy, scaffold size–based assignment to create balanced train, validation, and test splits, enhancing out-of-distribution evaluation.
Recent studies indicate that while the split simulates realistic OOD conditions, it may overestimate performance due to residual chemical similarities.

The Bemis–Murcko scaffold split is a widely adopted strategy for partitioning molecular datasets based on core scaffold structure, enforcing disjointness between train, validation, and test sets with respect to these scaffolds. Its principal objective is to ensure that predictive models are evaluated on their ability to generalize to unseen molecular frameworks—eschewing the artificial performance inflation typical of random splits that preserve local analogues across splits. The approach is operationalized through graph-theoretic definitions, canonical scaffold extraction (typically via the RDKit MurckoScaffold module), and assignment algorithms designed to balance dataset fractions while maintaining scaffold purity. However, recent investigations reveal both the methodological nuances and substantive limitations of this split, particularly its tendency to overestimate out-of-distribution (OOD) performance compared to chemical similarity–aware alternatives.

1. Definition and Formalization of the Bemis–Murcko Scaffold

The Bemis–Murcko scaffold, as introduced by Bemis and Murcko (1996), defines a molecule's "molecular framework" as the union of (a) all ring atoms and bonds, and (b) all acyclic linker atoms and bonds that connect two ring systems, with all terminal side chains eliminated. In the language of graph theory, a molecule is represented by an undirected attributed graph $G=(V,E)$ , where $V$ denotes atoms and $E$ chemical bonds.

Define:

$ring(G) = \{ v \in V \mid v\ \mathrm{lies\ on\ a\ cycle\ in}\ G \}$
$linker(G) = \{ v \in V \mid \exists u,w \in ring(G): v\ \mathrm{is\ on\ a\ shortest\ (acyclic)\ path\ between}\ u\ \mathrm{and}\ w \}$

Then, the Bemis–Murcko scaffold graph is:

$\mathcal{F}_{BM}(G) = (S, E_S), \quad S = ring(G) \cup linker(G),\quad E_S = \{ (u, v) \in E \mid u, v \in S \}$

Typically, the canonical scaffold is extracted using RDKit's MurckoScaffold algorithm, which produces a SMILES string uniquely identifying the scaffold. This core is the basis for partitioning, and molecules are grouped if they share the same canonical scaffold.

2. Algorithmic Approach to the Scaffold Split

The Bemis–Murcko scaffold split proceeds by assigning all molecules sharing an exact scaffold to the same fold, preventing scaffold overlap across train, validation, and test sets. The canonical implementation involves:

Extracting canonical scaffolds for all molecules.
Grouping molecules by their scaffold identity.
Sorting scaffolds in descending order of group size.
Greedily assigning to train, validation, or test folds to balance molecule counts and achieve predetermined ratios (typically 80/10/10).
Each scaffold's molecules are placed together in a fold, and assignment continues until capacities are filled.

As an illustration, formal pseudocode for $K$ -fold stratified assignment is:

Initialize $T_k \leftarrow \varnothing$ , $k = 1,\dots,K$ .
For each scaffold $s$ $s$ (ordered by decreasing $|G[s]|$ $∣ G [s] ∣$ ):
- $k^* \leftarrow \arg\min_{k} |T_k|$
- $T_{k^*} \leftarrow T_{k^*} \cup G[s]$
In cross-validation, each fold’s test set is $\bigcup_{s:c(s)=i} G[s]$ , train is the complement.

This mechanism guarantees strict scaffold disjointness between folds but does not ensure chemical similarity separation beyond the core. For medium-scale datasets, descending-size allocation mitigates micro-folds and degenerate splits (Wu et al., 23 Jan 2026).

3. Statistical Properties and OOD Hardness

Empirical evaluation of scaffold splits—exemplified on datasets comprising $\sim 1,800$ unique compounds—shows that the approach achieves near-target molecule fractions (train/val/test $\approx 80\%/10\%/10\%$ ) and ensures that no molecules with the same scaffold exist in multiple folds.

Statistical analysis includes characterizing OOD "hardness" via maximum ECFP4–Tanimoto similarity (maxSim) between molecules in the validation/test folds and any molecule in training. Reported median maxSim values for the test set range from 0.38 to 0.42, with wider IQRs than random splits, indicating increased (but not maximal) OOD character. In typical cheminformatic datasets, most scaffolds are singletons; a minority have small group sizes (2–10) (Wu et al., 23 Jan 2026).

Split	Train	Val	Test	Test maxSim Median / IQR
A	1526	162	164	0.41 / 0.27
B	1548	151	153	0.42 / 0.20
C	1488	193	171	0.38 / 0.21

4. Impact on Model Evaluation and Performance Metrics

Scaffold-disjoint splits present a more challenging OOD scenario than random splits, yielding lower chemical similarity between train and test sets and more realistic estimates of model generalizability. Standard performance metrics include:

Hit Rate (HR):

$\mathrm{HR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} \times 100\%$

Matthews Correlation Coefficient (MCC):

$\mathrm{MCC} = \frac{\mathrm{TP}\,\mathrm{TN} - \mathrm{FP}\,\mathrm{FN}} {\sqrt{(\mathrm{TP}+\mathrm{FP})(\mathrm{TP}+\mathrm{FN})(\mathrm{TN}+\mathrm{FP})(\mathrm{TN}+\mathrm{FN})}}$

ROC AUC: Area under the Receiver-Operating-Characteristic curve.
PR AUC: Area under the Precision–Recall curve.
RMSE:

$\mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n\bigl(y_i - \hat{y}_i\bigr)^2}$

Empirical results on NCI-60 datasets with three model families (Linear Regression, Random Forest, and GEM) demonstrate that the scaffold split produces substantially higher hit rates and ROC AUC values (median HR ≈ 68%, ROC AUC ≈ 0.61 for Random Forest) than chemically aware alternatives such as Butina (HR ≈ 62%, ROC AUC ≈ 0.56) or UMAP clustering splits (HR ≈ 2–12%, ROC AUC ≈ 0.52–0.58). Statistical tests (Wilcoxon, Cliff’s delta) confirm large, significant effects—indicating that scaffold splits can substantially overestimate model performance (Guo et al., 2024).

5. Limitations and Comparison with Alternative Splitting Strategies

The critical limitation of the Bemis–Murcko split is the potential for “scaffold leakage” due to core similarities between distinct scaffolds—e.g., benzene and pyridine being treated as disjoint but chemically similar. As a result, molecules in test folds can maintain high similarity to those in training, especially when substituents are ignored by construction. This can yield artificially optimistic generalization estimates relative to real-world prospective virtual screening and property modeling (Guo et al., 2024).

To address this, alternative splitting schemes have been introduced:

Butina Clustering Split: Groups molecules via Tanimoto similarity on Morgan fingerprints, then folds clusters to balance split sizes.
UMAP Clustering Split: Embeds molecular fingerprints into low-dimensional space with UMAP, partitions via $k$ -means to enforce global chemical dissimilarity.

Results across these denote a dramatic increase in OOD hardness; for example, Random Forest models’ hit rates plunge from 68% (scaffold split) to 2% (UMAP split), with medians and statistical significance consistently demonstrating the bias in scaffold-based evaluation.

6. Practical Implementation and Reproducibility

Leading implementations rely on cheminformatics libraries such as RDKit for scaffold extraction and SMILES canonicalization. Key steps involve:

Molecule sanitization, salt/solvent stripping, and removal of explicit hydrogens.

Extracting Bemis–Murcko scaffolds with:

1
2
3

from rdkit.Chem.Scaffolds import MurckoScaffold
scaffold_mol = MurckoScaffold.GetScaffoldForMol(mol)
scaffold_smiles = Chem.MolToSmiles(scaffold_mol, canonical=True)

Capacity-aware, greedy assignment of scaffold groups to folds, often persisting mapping for experiment reproducibility.
Released codebases and datasets export fold_maps as CSV or dedicated dataset fields for direct integration with machine learning pipelines (Wu et al., 23 Jan 2026).

7. Recommendations and Best Practices

Recent findings indicate that exclusive reliance on Bemis–Murcko scaffold splits can mislead benchmarking and model selection, especially for tasks demanding strict OOD generalization. Recommendations include:

Avoiding sole use of scaffold splits—supplementing with chemical similarity–aware or manifold-learning–based clustering approaches (e.g., UMAP + k-means).
For early-recognition VS tasks, prioritizing hit rate (HR) and MCC as primary evaluation metrics.
Validating models under multiple split regimes to better assess true generalization capacity to chemically novel regions (Guo et al., 2024).

A plausible implication is that as virtual screening libraries grow in chemical diversity (often $>10^{20}$ compounds), the limitations of scaffold splits will become increasingly consequential, necessitating more sophisticated, chemically informed partitioning strategies.

Markdown Report Issue Upgrade to Chat

References (2)

Safe Multitask Molecular Graph Networks for Vapor Pressure and Odor Threshold Prediction (2026)

Scaffold Splits Overestimate Virtual Screening Performance (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bemis–Murcko Scaffold Split.

Bemis–Murcko Scaffold Split

1. Definition and Formalization of the Bemis–Murcko Scaffold

2. Algorithmic Approach to the Scaffold Split

3. Statistical Properties and OOD Hardness

4. Impact on Model Evaluation and Performance Metrics

5. Limitations and Comparison with Alternative Splitting Strategies

6. Practical Implementation and Reproducibility

7. Recommendations and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Bemis–Murcko Scaffold Split

1. Definition and Formalization of the Bemis–Murcko Scaffold

2. Algorithmic Approach to the Scaffold Split

3. Statistical Properties and OOD Hardness

4. Impact on Model Evaluation and Performance Metrics

5. Limitations and Comparison with Alternative Splitting Strategies

6. Practical Implementation and Reproducibility

7. Recommendations and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research