ToxBench Dataset for ERα Binding Affinity

Updated 30 May 2026

ToxBench is a rigorously curated benchmark dataset of 8,770 ERα–ligand complexes with AB-FEP-derived binding free energies, enabling high-fidelity ML model development.
The dataset features diverse ligands from ChEMBL and DUD-E with non-overlapping SMILES splits and multiple receptor conformations to ensure robust benchmarking.
ToxBench supports rapid virtual screening by providing train/validation/test splits and experimental validation, significantly accelerating protein–ligand affinity prediction and toxicity profiling.

ToxBench is a large-scale, rigorously curated benchmark dataset consisting of 8,770 protein–ligand complex structures for the pharmaceutically critical target human Estrogen Receptor Alpha (ERα). Binding free energies for all complexes are calculated via absolute binding free energy perturbation (AB-FEP), providing high-fidelity labels specifically suited for ML model development and benchmarking in the context of protein–ligand affinity prediction and toxicity profiling. The dataset includes highly diverse ligands drawn from ChEMBL and DUD-E, spanning both agonist- and antagonist-like chemotypes, with all receptor conformations based on three protein crystal structures. ToxBench also provides rigorous train/validation/test splits at the SMILES level to assess model generalizability, and supplies experimental benchmarks enabling direct evaluation of method accuracy (Liu et al., 11 Jul 2025).

1. Dataset Composition and Molecular Diversity

ToxBench comprises 8,770 ERα–ligand complexes, each labeled with an AB-FEP-calculated binding free energy (ΔG_binding). The ligand catalogue includes 699 unique SMILES, integrating both 473 ChEMBL compounds (with high-quality reported binding data) and active/decoy subsets from DUD-E. Chemotypes encompass agonist- and antagonist-like scaffolds with significant chemical diversity.

Receptor templates originate from three distinct ERα conformations: PDB 1ERE (agonist), PDB 3ERT (antagonist), and PDB 1SJ0 (alternative state), each pre-processed using Schrödinger’s Protein Preparation Wizard. Ligand–protein poses are generated through induced-fit docking and molecular dynamics (MD) for ChEMBL ligands, or via Glide-WS for DUD-E structures. Each resulting complex then undergoes 1 ns of AB-FEP simulation in explicit TIP3P-like solvent (using the OPLS4 force field within Schrödinger’s FEP+ implementation).

Train/validation/test splits are constructed to avoid SMILES overlap, providing:

Split	Complexes	Unique Ligands
Training	6,144	488
Validation	1,317	102
Test	1,309	109

2. AB-FEP Labeling Protocol and Theoretical Foundations

Each ToxBench binding free energy label is derived from rigorous all-atom AB-FEP, combining MD simulation with alchemical free energy perturbation. The theoretical basis follows:

$\Delta G_{\mathrm{binding}} = -k_{B} T \ln K_{\mathrm{eq}}$

where $K_{\mathrm{eq}}$ is the equilibrium association constant. The computational AB-FEP protocol adheres to Boresch et al. (2003) and Zwanzig (1954), decomposing the total free energy via

$\Delta G_{\mathrm{binding}} = \Delta G_{\mathrm{restraints}} + \Delta G_{\mathrm{decoupling}} + \Delta G_{\mathrm{unrestraints}}$

Each thermodynamic leg—application of orientational restraints, alchemical decoupling, and release of restraints—is evaluated by its own simulation. All AB-FEP runs employ 1 ns of sampling per leg and are conducted in explicit solvent. The full workflow on a single NVIDIA T4 GPU requires approximately 35 hours per complex for a 1 ns simulation (Liu et al., 11 Jul 2025).

3. Experimental Validation and Quality Metrics

A subset of 67 ChEMBL-derived ligands with high-confidence experimental affinities (converted to kcal/mol) serves to validate AB-FEP labels. Empirical “structural reorganization penalties”—e.g., +8.69 kcal/mol for 1ERE, +6.11 kcal/mol for 3ERT—are incorporated into each PDB-specific calculation, with the most favorable ΔG across all conformers reported for each ligand.

AB-FEP’s agreement with experiment on this curated set is characterized by:

Root mean squared error (RMSE): 1.754 kcal/mol
Pearson correlation, $R_p$ : 0.692

This error magnitude is commensurate with the intrinsic uncertainty typical of experimental binding affinity assays (~1 kcal/mol), supporting the reliability of ToxBench labels for ML development (Liu et al., 11 Jul 2025).

4. Data Features, Representations, and Preprocessing

Each protein–ligand complex is represented by:

Atomic feature matrix $A \in \mathbb{R}^{n \times d}$ (atom types, partial charges, etc.)
Coordinate matrix $X \in \mathbb{R}^{n \times 3}$

Ligand-only baselines such as Chemprop utilize 2D molecular graphs and learned molecular fingerprints. In contrast, interaction-aware models (e.g., AEV-PLIG and DualBind) represent complexes through atomic environment vectors and learn SE(3)-invariant energy functions based directly on $(A, X)$ .

Preprocessing steps standardize SMILES, apply Epik7/LigPrep protonation, and enforce a cap for weak/non-binders ( $\Delta G > -3.0$ kcal/mol set to –3.0 kcal/mol). The non-overlapping SMILES split strategy is used for all train/validation/test divisions to rigorously control for information leakage.

5. Machine Learning Benchmarks and DualBind Model

Three representative ML methods are benchmarked using ToxBench:

Chemprop (ligand-only message-passing network):
- $R_p = 0.669 \pm 0.011$
- $R^2 = 0.445 \pm 0.016$
- RMSE = $K_{\mathrm{eq}}$ 0 kcal/mol
AEV-PLIG (3D interaction-aware, atomic environment vectors + GAT):
- $K_{\mathrm{eq}}$ 1
- $K_{\mathrm{eq}}$ 2
- RMSE = $K_{\mathrm{eq}}$ 3 kcal/mol
DualBind (proposed; dual-loss framework):
- DualBind employs a supervised mean squared error (MSE) loss:
$K_{\mathrm{eq}}$ 4

and an unsupervised denoising score matching (DSM) loss:

$K_{\mathrm{eq}}$ 5

Combined as:

$K_{\mathrm{eq}}$ 6

with $K_{\mathrm{eq}}$ 7 and ligand coordinate perturbations $K_{\mathrm{eq}}$ 8. - $K_{\mathrm{eq}}$ 9 - $\Delta G_{\mathrm{binding}} = \Delta G_{\mathrm{restraints}} + \Delta G_{\mathrm{decoupling}} + \Delta G_{\mathrm{unrestraints}}$ 0 - RMSE = $\Delta G_{\mathrm{binding}} = \Delta G_{\mathrm{restraints}} + \Delta G_{\mathrm{decoupling}} + \Delta G_{\mathrm{unrestraints}}$ 1 kcal/mol - Inference runtime: ~126 ms per complex (A100 GPU), representing an approximately $\Delta G_{\mathrm{binding}} = \Delta G_{\mathrm{restraints}} + \Delta G_{\mathrm{decoupling}} + \Delta G_{\mathrm{unrestraints}}$ 2-fold speedup relative to AB-FEP simulation (Liu et al., 11 Jul 2025).

6. Practical Applications, Use Cases, and Integration

ToxBench-trained ML models enable high-throughput virtual screening, permitting millisecond-scale affinity estimation per compound and rapid prioritization of extensive chemical libraries. This is particularly impactful for toxicity and off-target profiling; since ERα is central to endocrine disruption, ToxBench facilitates flagging of potential endocrine disruptors. Recommended best practices include:

Adoption of provided ligand SMILES splits and 3D complex structures for reproducibility.
Use of the non-overlapping SMILES split strategy when extending to novel ligands.
Consistent handling of weak binders by capping affinities above –3 kcal/mol.
For 3D-aware ML pipelines, implementation of the dual-loss energy framework (DualBind) to balance affinity label fidelity with structural regularization (Liu et al., 11 Jul 2025).

7. Limitations and Prospects for Extension

ToxBench’s current scope is constrained to ERα, which limits out-of-domain generalizability. The AB-FEP label accuracy is contingent on 1 ns MD sampling and the application of PDB-specific empirical penalties, and may benefit from extended simulation or ensemble approaches. Future enhancements may involve expansion to multi-target AB-FEP datasets, increasing the proportion of entries with experimental affinity labels, and the exploration of advanced sampling and solvation techniques. These extensions hold potential for establishing broader, more transferable benchmarks and improving the reliability of ML-predicted binding affinities across diverse protein family landscapes (Liu et al., 11 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ToxBench: A Binding Affinity Prediction Benchmark with AB-FEP-Calculated Labels for Human Estrogen Receptor Alpha (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ToxBench Dataset.