ADMET Benchmark Group

Updated 7 September 2025

ADMET Benchmark Group is a framework that systematically evaluates computational predictors for absorption, distribution, metabolism, excretion, and toxicity of drug-like molecules.
It curates diverse benchmark datasets from sources like ChEMBL and TDC, employing scaffold, temporal, and out-of-distribution splits to ensure robust evaluation.
It drives methodological advances by comparing classical models, graph neural networks, and multimodal approaches to improve predictive accuracy and generalization.

The ADMET Benchmark Group is a collective framework within the cheminformatics and biomedical AI communities that systematically evaluates computational predictors for Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties of drug-like molecules. ADMET characteristics are a principal determinant of drug candidate success and contribute to approximately half of all clinical trial failures, making accurate early-stage in silico prediction essential for efficient drug discovery pipelines (Feinberg et al., 2019). The group is associated with curated benchmark datasets, evaluation protocols, and comparative studies leveraging a wide array of molecular representation methods, machine learning strategies, and cross-validation formulations.

1. Scope and Rationale for Benchmarking

Computational ADMET modeling aims to preemptively identify compounds with problematic pharmacokinetic or safety profiles prior to expensive and time-consuming laboratory validation. Historically, fixed molecular fingerprints or manually engineered descriptors—such as circular fingerprints, atom-pair descriptors, or property lists—served as the input for classical machine learning algorithms including random forests, support vector machines, and gradient-boosted trees. However, as chemical libraries have grown in size and diversity, and as assay protocols have become more heterogeneous, benchmarks have become indispensable for fairly assessing model generalizability, robustness, and limitations across realistic chemical spaces (Wei et al., 2023).

The ADMET Benchmark Group therefore defines and distributes rigorously curated datasets (e.g., from ChEMBL, TDC, internal pharmaceutical company repositories), together with standardized evaluation metrics and cross-validation strategies, to facilitate direct, meaningful comparison of new algorithms and molecular featurization strategies (Broccatelli et al., 2021, Ji et al., 2022).

2. Benchmark Datasets and Evaluation Protocols

A central aspect of the group’s contributions is the careful curation of ADMET assay data and the design of realistic benchmark partitions:

Data Sources and Properties: Publicly available resources such as TDC (Therapeutics Data Commons) and ChEMBL are used to extract a range of ADMET endpoints—lipophilicity, solubility, CYP inhibition, membrane permeability, volume of distribution, toxicity markers, and more. Recent benchmarks such as ADMEOOD (Wei et al., 2023) include 27 properties spanning all ADME dimensions.
Partitioning Schemes: Recognizing the challenge of overfitting to narrow chemical subspaces, benchmarks employ not only random splits but also scaffold-based, temporal, and molecular weight–constrained splits. For example, molecules are divided by their date of profiling, or scaffold similarity, with test sets purposely constructed to be temporally or structurally “out-of-distribution” relative to training (Feinberg et al., 2019, Broccatelli et al., 2021).
OOD (Out-of-Distribution) Partitions: Inspired by real-world assay variability and chemical novelty, state-of-the-art benchmarks such as DrugOOD (Ji et al., 2022) and ADMEOOD (Wei et al., 2023) explicitly create splits involving domain shifts (e.g., unseen assays, structural motifs, or molecule sizes) and include annotation of label noise and concept conflict drift—scenarios where identical molecules can receive conflicting assay results.

These rigorous splits enable differentiation between mere memorization and genuine chemical extrapolation in predictive models.

3. Model Classes and Molecular Representations

Benchmark studies systematically evaluate a broad suite of modeling strategies:

Classical Methods: Random forests, SVMs, and gradient-boosted trees (notably XGBoost (Tian et al., 2022) and CatBoost (Notwell et al., 2023)) using fixed fingerprints (e.g., ECFP, Avalon, ErG), Mordred/RDKit/PaDEL descriptors, and/or engineered property lists remain highly competitive, especially when features are systematically combined and pipelines are tuned using AutoML approaches (Sá et al., 22 Feb 2025, Le et al., 9 Jun 2025).
Graph Neural Networks (GNNs): Recent benchmarks highlight end-to-end models that operate on the molecular graph, mapping atoms and bonds to node and edge features. PotentialNet (Feinberg et al., 2019) and related message passing neural networks leverage learned featurization directly from connectivity data; deeper GNN variants include GCN, GAT, MPNN, and AttentiveFP, with attention mechanisms (GAT) frequently exhibiting improved generalization to external data (Broccatelli et al., 2021).
Multimodal and Hybrid Methods: Novel frameworks fuse graph-based and image-based encodings, aligning learned representations via contrastive objectives (e.g., MolIG (Wang et al., 2023)) to exploit both local atomic topology and global structural context.
Foundation Models and Self-supervised Pretraining: Two-stage frameworks pretrain on unlabeled SMILES strings (e.g., SMILES-Mamba (Xu et al., 11 Aug 2024)) or graph-structured atomic quantum mechanical properties (Fallani et al., 10 Oct 2024), then fine-tune for task-specific ADMET prediction. These settings enable efficient utilization of large, unlabeled chemical corpora, reducing label demands and enhancing representation learning.
AutoML and Pipeline Optimization: Evolving, data-adaptive pipelines (e.g., Auto-ADMET (Sá et al., 22 Feb 2025)) construct optimal combinations of featurization, scaling, feature selection, and machine learning algorithms, incorporating interpretability via Bayesian network classifiers to guide and explain evolutionary search.

The table below summarizes commonly benchmarked model classes and key feature sets:

Model Class	Feature Modalities	Benchmark Outcomes
Random Forest / GBDT	ECFP, Avalon, ErG, RDKit/Mordred	State-of-the-art in several ADMET tasks
GNNs (PotentialNet, GAT, MPNN, AttentiveFP)	Atom/bond graph, learned embeddings	GAT shows best OOD generalization, robust on external data
Multimodal (MolIG)	Graph + molecular image	Outperforms single-modal baselines
Foundation (SMILES-Mamba, Graphormer pretraining)	SMILES sequence, atomic QM	Top-1 performance in diverse benchmarks
AutoML (Auto-ADMET, CaliciBoost)	Dynamic selection among above	Personalized, interpretable, best on several datasets

4. Evaluation Metrics and Cross-Validation Considerations

The ADMET Benchmark Group promotes the use of multiple, chemically meaningful metrics:

Regression Tasks: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and $R^2$ capture the accuracy of continuous endpoint prediction (e.g., logD, logP, clearance).
Classification Tasks: Area Under the ROC Curve (AUROC), Area Under the Precision-Recall Curve (AUPRC), and Matthews Correlation Coefficient (MCC) provide balanced performance estimates for binary or multi-class endpoints.
Error Propagation and Practical Limits: Studies that include experimental error propagation (i.e., comparing in silico predictions to independent laboratory measurements) emphasize that model performance may be constrained by the inherent reproducibility and noise in the underlying assays (Broccatelli et al., 2021). In some cases, predictive error approaches that of inter-assay variability, suggesting that further improvements are contingent upon higher-quality experimental data (Wei et al., 2023).

Many benchmarks advocate for standardized nested cross-validation or stratified splits (scaffold, temporal, size), accompanied by robustness checks such as permutation importance and SHAP analyses to interpret feature contributions (Le et al., 9 Jun 2025).

5. Benchmark-Informed Methodological Advances

Adherence to strict benchmarking has driven several advances:

Deep Featurization: PotentialNet demonstrated that graph convolutional approaches optimizing learned atom-wise features end-to-end for specific ADMET endpoints yield higher accuracy and better extrapolation to novel chemistry than fixed fingerprints (Feinberg et al., 2019).
Attention Mechanisms and Multitask Learning: GATs and multi-task frameworks further boosted generalizability, particularly to chemical series divergent from the training compounds; attention-based pooling selectively focuses on relevant structural motifs (Broccatelli et al., 2021).
Self-supervised and Multimodal Representation Learning: Hypothesis-free pretraining, especially on large unlabeled molecular datasets, enables models to capture transferable structural information. Integration of graph and image modalities allows ADMET models to combine local and global chemical cues, improving prediction for endpoints such as membrane permeability and metabolic stability (Wang et al., 2023, Xu et al., 11 Aug 2024).
AutoML and Pipeline Personalization: Automated, grammar-guided pipeline construction introduces adaptability for novel chemical spaces, while Bayesian network–guided optimization both explains and expedites performance improvements (Sá et al., 22 Feb 2025).

6. Out-of-Distribution Robustness and Real-World Generalization

Recent focus has turned to OOD robustness—a critical property for practical deployment:

Explicit OOD Benchmarks: Both DrugOOD (Ji et al., 2022) and ADMEOOD (Wei et al., 2023) curate challenging OOD splits, annotating datasets by domain (e.g., scaffold cluster, target protein, assay environment) and noise level (core, refined, general), and quantifying the drop in accuracy ( $\text{Gap} = \text{AUC}_{\text{ID}} - \text{AUC}_{\text{OOD}}$ ) when models are evaluated on out-of-domain data.
Error Characterization: Models typically suffer substantial decreases in predictive performance under OOD or label conflict conditions (for instance, ERM AUC dropping from 91.97% IID to 83.59% OOD), highlighting the need for domain-adaptive or invariant-learning strategies (Wei et al., 2023).
Methodological Implications: No single algorithm universally excels across all OOD conditions; ongoing research is required to further improve model robustness and causal generalization (e.g., via invariant risk minimization, Mixup, GroupDRO).

7. Impact and Future Directions

The ADMET Benchmark Group catalyzes method development by exposing critical gaps and facilitating transparent, reproducible model comparison. Its influence is tangible in several areas:

Software and Web Resources: Models achieving top-ranked performance (for example, XGBoost-ensembles (Tian et al., 2022) and graph-based frameworks (Feinberg et al., 2019, Zhang et al., 2022)) are now accessible via interactive servers (e.g., ADMETboost, HelixADMET), supporting adoption by the broader biomedical community.
Expanding Endpoints and Customization: Newer systems emphasize extensibility, allowing researchers to define and benchmark custom ADMET properties as research needs evolve (Zhang et al., 2022).
Integrative and Hybrid Approaches: There is increasing interest in multimodal, multi-task, and foundation model solutions that can assimilate larger, noisier, and more diverse chemical and biological information, leveraging quantum chemical pretraining, image/graph fusion, or sequence-based learning (Fallani et al., 10 Oct 2024, Wang et al., 2023, Xu et al., 11 Aug 2024).
Guidelines for Feature Selection: Empirical evidence confirms that including three-dimensional molecular descriptors, where available, and systematically applying feature importance analyses can lead to substantial accuracy gains, particularly for spatially-dependent endpoints such as permeability (Le et al., 9 Jun 2025).

The ADMET Benchmark Group’s continued curation of realistic, challenging datasets and rigorous methodologies will remain central to advancing generalizable, interpretable, and robust ADMET prediction, thereby reducing attrition and cost in the drug discovery process.