DeepScaffold: Scaffold-Driven Drug Design
- DeepScaffold is a deep learning platform for scaffold-based de novo drug discovery, generating chemically valid molecules with prescribed cores.
- It integrates graph-based variational autoencoders and conditional graph generative models to convert abstract cyclic skeletons into detailed scaffolds.
- The model enforces strict chemical validity and demonstrates high docking performance against targets like DRD2 through rigorous post-generation filtering.
DeepScaffold is a deep learning-based platform designed for scaffold-based de novo drug discovery, enabling the generation of chemically valid, diverse, and property-faithful molecules that retain prescribed core scaffolds. Its methodology is grounded in graph-based variational autoencoders (VAEs), conditional graph generative models, and rigorous post-generation filtering, operating on various scaffold abstraction levels including cyclic skeletons, classical scaffolds, and user-constrained pharmacophores. DeepScaffold integrates extraction, modeling, validation, and evaluation of chemical scaffolds, addressing the challenge of producing novel compounds that both preserve specified molecular cores and optimize for desired physicochemical and bioactivity characteristics (Li et al., 2019).
1. Model Architecture and Mathematical Framework
DeepScaffold's architecture comprises three interconnected neural components:
- Cyclic Skeleton→Scaffold VAE: This module translates abstract cyclic skeletons, , into fully specified classical scaffolds, , by decoding atom and bond types onto a prescribed topology. Each node and edge in is augmented by a latent variable , yielding the conditional likelihood expression:
The prior and posterior over are modeled as multivariate Gaussians, learned via a VAE objective:
- Scaffold-Based Molecule Generator: Molecule generation is posed as a stepwise, autoregressive graph construction process. At each step , the state is updated by one of three actions: addition of a new atom and connecting bond, formation of a new internal bond, or termination. The transition is governed by:
where all transitions are conditioned on the fixed scaffold 0. The generator is realized as a dense, 20-layer GNN.
- Pharmacophore Filter: Post-generation, molecules are filtered according to user-specified side-chain constraints (e.g., H-bond donors/acceptors, heavy atom count) and validated for chemical consistency using RDKit's sanitize functionality.
The core GNN applies a three-stage message passing: (1) node-to-edge broadcasting via an MLP, (2) edge-to-node aggregation (max and sum pooling), and (3) an update step combining original and gathered edge features.
2. Scaffold Definitions, Extraction, and Encoding
DeepScaffold systematically formalizes molecular scaffolds using HierS (hierarchical scaffold extraction) and Bemis–Murcko (BM) frameworks:
- Datasets: Standardized ChEMBL molecules (1; C, N, O, F, P, S, Cl, Br, I only; QED > 0.5).
- HierS Extraction: For each molecule, unique ring-systems are recursively segmented to yield a hierarchy of scaffolds 2, each possessing a cyclic skeleton 3 (atom/bond types stripped) and the set of molecules, 4, that contain 5.
- Scaffolds with Side-Chain Specifications: User queries impose constraints (e.g., heavy atom count, H-bonding properties) at the side-chain level.
Graph encoding is as follows:
- Cyclic skeletons 6: Nodes are "unknown atoms" (one-hot), edges are "unknown bonds" plus virtual bonds, with incidence graphs incorporating both atom and bond nodes.
- Classical scaffolds 7: Nodes encode atom type, valence, and aromaticity; edges encode bond type. The input graph is represented as an adjacency tensor (8) and node feature matrix (9).
3. Training Regime and Data Splits
- CSK→Scaffold VAE: The scaffold set 0 is split 80%/20% for train/test. Latent dimension is 10 per node/edge. The AdaBound optimizer is used (learning rate 1 to 2, gradient clip 3).
- Scaffold Generator: 80%/20% molecule–scaffold pairs for training and testing (scaffold overlap permitted). For privileged GPCR scaffolds, all related molecules are excluded from training. The optimizer is Adam with a scheduled learning rate decay.
- Importance Sampling and Route Uncertainty: Training is improved by sampling 4 assembly routes and penalizing for route ambiguity (5), as in previously published graph generative models.
At every generation step, only valence- and connectivity-respecting actions are permitted, enforced via strict chemical validity checks.
4. Scaffold-Based Graph Generation: Algorithmic Workflow
The molecule generation process can be summarized by the following pseudocode:
1
Constraints (valence, bond order, ring closure) are strictly enforced at each step. Validity is checked with RDKit's Sanitize method.
5. Evaluation Metrics and Property Analysis
Performance is assessed on several axes:
| Metric | Description | Typical Value |
|---|---|---|
| Validity | Fraction of generated molecules accepted by RDKit | 6 |
| Uniqueness | Fraction of valid samples not isomorphic to others | 7 |
| MMD | Tanimoto kernel MMD (generated vs test) | 8 (scaffold 12) |
| Internal Diversity | 9 | 0–1 |
| R_actives / drugs | Fraction reproducing GPCR actives/drugs from ChEMBL | 2–3/4–5 |
| Docking Score | Against DRD2 (AutoDock Vina, kcal/mol, median) | 6 to 7 |
Molecular property distributions (MW, log P, QED) of generated samples match reference data, with low MMD indicating faithfulness. Internal diversity reflects chemical novelty.
6. Molecular Docking and Application to Drug Design
DeepScaffold's generative products are evaluated by docking against the Dopamine D8 receptor (DRD2), using experimentally prepared receptor structures. Distributions of docking scores for generated molecules closely match those of test-set and known actives, and outperform random ChEMBL decoys. Case studies for privileged GPCR scaffolds (12–14) demonstrate the model's ability to generate molecules with both desirable docking profiles and high property diversity.
7. Software Implementation and Example Use Cases
DeepScaffold offers end-to-end command-line interoperability:
- Installation:
pip install deepscaffold - Scaffold Extraction: Extraction from SMILES using
deepscaffold extract_scaffolds - Model Training (VAE and Generator): Commands for scaffold VAE and molecule generator, specifying epochs, batch size, and output directory.
- Molecule Generation: Conditionally generate valid molecules with a specified scaffold and optional pharmacophore requirements, directly outputting annotated SMILES.
Example output: For scaffold 13, one possible generated structure is
c1cc2nccc2cc1C3CCN(CC3)C(=O)C (9, 0).
This pipeline supports large-scale in silico generation and filtering of scaffolds and molecular libraries tailored to user-defined constraints and drug-target selectivity, leveraging modern deep learning and chemical informatics toolkits (Li et al., 2019).