Bemis–Murcko Scaffold Split Analysis

Updated 30 January 2026

Bemis–Murcko scaffold split is a protocol that partitions molecules by their core structural frameworks to ensure strict OOD evaluation.
The method employs capacity-aware and balanced k-fold algorithms to allocate entire scaffold groups into distinct training, validation, and test sets.
Empirical findings indicate that while the split limits direct scaffold overlap, high inter-fold similarity can still lead to overestimated model performance in virtual screening.

The Bemis–Murcko scaffold split is a protocol in cheminformatics and machine learning for molecular property prediction, wherein molecules are grouped and partitioned strictly according to their core structural frameworks—Bemis–Murcko scaffolds—such that no scaffold present in the training set appears in validation or testing. This split is widely used as an out-of-distribution (OOD) benchmark for evaluating model generalization to unseen chemical motifs. Despite its popularity, recent studies highlight limitations and empirical pitfalls in OOD evaluation and prospective performance, especially in virtual screening scenarios.

1. Formal Definition of the Bemis–Murcko Scaffold

A Bemis–Murcko scaffold is formally defined on the bond-atom graph representation of a molecule $G=(V, E)$ , where $V$ is the set of atoms and $E$ is the set of covalent bonds. The scaffold extraction proceeds as follows:

Ring Atom Identification: Let $R=\{v\in V\mid v$ lies on at least one simple cycle in $G\}$ . These are all atoms in rings.
Linker Atom Identification: For every unordered pair of distinct ring atoms $u,w\in R$ , let $P_{u,w}$ be one of the shortest simple paths in $G$ from $u$ to $w$ . Define $L=\bigcup_{u,w\in R}(V(P_{u,w}) \setminus \{u,w\})$ . Linkers are the non-ring atoms on minimal paths connecting rings.
Scaffold Subgraph: The Bemis–Murcko scaffold $S(G)$ is the induced subgraph $S(G)=G[R\cup L]$ with edge set $E_S=\{(u,v)\in E\mid u,v\in R\cup L\}$ . This process iteratively strips acyclic “leaf” atoms not in $R$ , while optionally accounting for tautomers and stereochemistry.

In practice, RDKit’s GetBemisMurckoFramework function is used to extract canonical SMILES representations of scaffolds, with no standardization for tautomerism and stereochemistry in default use (Guo et al., 2024).

2. Scaffold Grouping and Split Algorithms

The protocol assigns molecules to folds by equivalence classes defined by their scaffold SMILES, enforcing zero overlap of scaffolds between partitions:

Capacity-aware Split (80/10/10): For a dataset of $N$ molecules, molecules are grouped by extracted scaffold SMILES. Scaffold groups are sorted by descending size and allocated greedily to training, validation, or test to approximately achieve target ratios $N_{\rm train} = 0.8N$ , $N_{\rm val} = 0.1N$ , $N_{\rm test} = N - N_{\rm train} - N_{\rm val}$ . Entire scaffold groups are forced into a single fold to ensure strict OOD evaluation (Wu et al., 23 Jan 2026).

Balanced k-Fold Assignment: For k-fold cross-validation, scaffolds are sorted by group size and assigned to folds with minimal current total, balancing fold sizes. Example pseudocode (RDKit):

scaffold_map = {}  # scaffold_smiles → list of molecule IDs
for mol_id, smiles in enumerate(all_smiles):
    m = Chem.MolFromSmiles(smiles)
    scaf = MurckoScaffold.GetScaffoldForMol(m)
    s_smiles = Chem.MolToSmiles(scaf, isomericSmiles=False)
    scaffold_map.setdefault(s_smiles, []).append(mol_id)
# assign scaffold groups to folds (k=7)
folds = [ [] for _ in range(k) ]
fold_sizes = [0]*k
items = sorted(scaffold_map.items(), key=lambda x: -len(x[1]))
for s_smiles, mol_ids in items:
    i = argmin(fold_sizes)
    folds[i].extend(mol_ids)
    fold_sizes[i] += len(mol_ids)

(Guo et al., 2024).

3. Motivations and Rationale in Model Evaluation

Scaffold splits are motivated by the desire for OOD evaluation: random splits intermingle similar compounds in train/test, overestimating generalization. Partitioning by Bemis–Murcko framework is hypothesized to represent realistic prospective scenarios, e.g., predicting properties for novel chemotypes in drug discovery.

A key implementation detail is the guarantee that no molecule in validation/test violates chronological order with respect to training analogues, preventing “future-to-past” leakage (Wu et al., 23 Jan 2026).

4. Empirical Findings and Limitations

Recent work demonstrates scaffold splits can themselves overestimate model performance, particularly in virtual screening:

On NCI-60 datasets with 30,000–50,000 molecules, scaffold splits led to much higher hit rates and ROC/AUC for logistic regression (LR), random forest (RF), and gradient boosting (GEM) than UMAP clustering or Butina clustering:

| Split Type | LR Hit Rate | RF Hit Rate | GEM Hit Rate | |---------------|-------------|-------------|--------------| | Scaffold | 78.8% | 80.2% | 75.2% | | Butina | 10.0% | 57.9% | 45.6% | | UMAP | 0.0% | 0.0% | 11.9% |

(Guo et al., 2024)

Even with different scaffolds, molecules across train/test may exhibit high Tanimoto similarity (e.g., benzene vs pyridine), undermining the intended OOD separation.
In molecular property prediction (e.g., vapor pressure, odor threshold), median max-similarity (ECFP4 Tanimoto) between validation molecules and training examples is ∼0.37–0.42 under scaffold split, with a wide IQR and a long tail (Wu et al., 23 Jan 2026).

A plausible implication is that scaffold splits, while stricter than random splits, are insufficient for rigorous OOD benchmarking in chemically diverse libraries.

5. Implementation Details and Best Practices

RDKit’s MurckoScaffold is the de facto tool for scaffold extraction, frequently used with canonical SMILES representations. Key practices include:

Sanitization and removal of side-chain substituents precedes scaffold extraction.
Equivalent classes by scaffold SMILES define split membership (train/val/test).
Strict group allocation ensures all endpoints for a molecule reside in the same fold.
Chronological validation checks (e.g., no test molecule predates its train analogue in publication/measurement year).
Winzorization and robust normalization (e.g., log-space, median/MAD scaling) address heavy-tailed target distributions (Wu et al., 23 Jan 2026).

6. Diagnostic Analyses and Error Characterization

Scaffold split studies systematically report per-bin test error stratified by structural similarity to the training set:

For single-task VP prediction (PNA + A20/E17 under scaffold split), normalized MSE decreases with increasing train/test similarity: 0.324 (max-sim ∈ [0,0.3)), 0.241 ([0.3,0.5)), 0.194 ([0.5,0.7)), 0.162 ([0.7,1.0]).
Safe-multitask training yields uniformly better curves, indicating mild regularization from auxiliary OP prediction without harming primary task accuracy.
Fingerprint concatenation slightly worsens OOD performance under scaffold split, supporting graph-based architectures (Wu et al., 23 Jan 2026).

7. Alternatives and Comparison with Other Splitting Protocols

Guo et al. (Guo et al., 2024) provide direct comparisons:

Scaffold split: Highest hit-rate but substantial overestimation of generalization performance.
Butina clustering: Intermediate hit-rate objectively superior to random splits.
UMAP clustering: Most realistic—and challenging—OOD evaluation with near-zero hit rates for linear/rf models.
For model selection, avoiding scaffold split is recommended except in benchmarks where the scaffold-specific OOD regime is justified.

8. Summary and Implications

The Bemis–Murcko scaffold split implementably enforces rigorous structural separation in molecular data partitions, supporting community best-practice in OOD evaluation. However, as detailed in recent empirical studies, scaffold splits may still introduce high inter-fold molecular similarity, risking misestimation of real-world prospective model utility. Modern benchmarking increasingly supplements or replaces scaffold splits with manifold-clustering approaches to strengthen OOD diagnostics. The precise, reproducible implementation via RDKit and canonical SMILES is established, but evaluation metrics should always be interpreted in context of split protocol’s limitations (Wu et al., 23 Jan 2026, Guo et al., 2024).

Markdown Upgrade to Chat

References (2)

Scaffold Splits Overestimate Virtual Screening Performance (2024)

Safe Multitask Molecular Graph Networks for Vapor Pressure and Odor Threshold Prediction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bemis-Murcko Scaffold Split.