Multitask ADMET Data Splits

Updated 16 October 2025

The paper introduces a data split framework that aligns multiple ADMET endpoints to prevent cross-task leakage and boost inductive transfer in drug discovery models.
It employs temporal, scaffold, and cluster splitting strategies to ensure rigorous benchmarking and realistic validation of predictive models.
The approach integrates adaptive task weighting and loss optimization to overcome data imbalance and improve overall model performance.

Multitask ADMET data splits refer to the design and partitioning of datasets for the simultaneous predictive modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties in drug discovery applications. ADMET endpoints span diverse biochemical, physical, and toxicity measurements and are frequently assayed across large compound libraries. Multitask learning leverages information sharing among these heterogeneous endpoints to improve model generalization, particularly when data for individual endpoints are scarce or imbalanced. The construction and use of multitask ADMET data splits are central to rigorous benchmarking, avoiding data leakage, and realizing benefits from inductive transfer in deep learning, tree-based, or quantum-informed models.

1. Fundamentals of Multitask ADMET Data Splits

Multitask ADMET data splits organize datasets such that multiple property endpoints for a given set of small molecules are modeled simultaneously, often using a shared representation. Rather than generating independent splits for each endpoint, multitask splits maintain aligned train, validation, and test partitions for all endpoints, ensuring that each compound's data are split consistently across tasks. This alignment is critical for preventing information leakage and for enabling effective inductive transfer between related endpoints, particularly when deploying models such as multitask neural networks (MTNNs), multitask graph neural networks (GNNs), tree-based ensembles, or quantum-enhanced architectures.

Two primary splitting strategies are prevalent:

Temporal splits: Partition compounds based on the chronology of experiment dates or addition to a database, simulating prospective ADMET property prediction and evolution of chemical space (Adrian et al., 14 Oct 2025).
Cluster/scaffold splits: Group compounds by chemical scaffolds (e.g., Bemis–Murcko) or using clustering techniques on fingerprint vectors (e.g., PCA-reduced Morgan fingerprints) to maximize structural diversity between train and test sets (Adrian et al., 14 Oct 2025, Tian et al., 2022).

A robust multitask ADMET data split ensures that no compound in a test set has corresponding measurements in the training or validation set for any endpoint. Maintaining such alignment across properties is essential for reliable benchmarking and generalizability assessment.

2. Benchmark Datasets and Splitting Methodologies

Recent studies have introduced standardized multitask ADMET benchmark datasets and splitting procedures:

Published Multitask Datasets: For example, a 30-endpoint internal Merck dataset with over 800,000 compounds is split temporally (80–20) by compound addition date, while a 25-endpoint public dataset (114,112 compounds) from literature is split by clustering on Morgan fingerprints (Adrian et al., 14 Oct 2025). Other benchmarks such as the Therapeutics Data Commons (TDC) facilitate comparison across 13–22 ADMET tasks using aligned scaffold splits (Zhang et al., 4 Sep 2025, Tian et al., 2022, Notwell et al., 2023).
Validation Schemes:
- Temporal validation: Training, validation, and test splits are determined by assay date, simulating real-world compound progression (Kearnes et al., 2016, Adrian et al., 14 Oct 2025).
- Scaffold-based validation: Ensures robust assessment on novel chemotypes by splitting along core chemical scaffolds across all tasks (Tian et al., 2022, Zhang et al., 2022).
- Maximum Dissimilarity (MD-FIS-WD): Uses both molecular descriptors and endpoint values to produce splits with representative data distributions and low subset error across tasks (Ye et al., 2018).

Table: Example Multitask ADMET Data Split Strategies

Study/Dataset	Splitting Method	Key Features
Merck 30-endpoint (Adrian et al., 14 Oct 2025)	Temporal (80–20)	Large-scale, time-aware, industrial data
Public 25-endpoint (Adrian et al., 14 Oct 2025)	Cluster (PCA of fingerprints)	Literature-sourced, diverse endpoints
TDC 22-endpoint (Tian et al., 2022, Zhang et al., 4 Sep 2025)	Scaffold-based	Benchmark group, standardized multi-task splits

Dataset splits designed for multitask models minimize cross-task leakage and allow for comprehensive benchmarking of both model generalization and transfer effects.

3. Impact of Data Splitting on Model Performance and Inductive Transfer

The nature of multitask splits directly influences observed multitask effects and model generalization:

Inductive transfer and multitask benefit: The magnitude of improvement from multitask learning is dictated by both dataset size and the relatedness of ADMET endpoints. Smaller datasets benefit most, as pooling side information via multitask learning yields more robust shared features, while larger datasets exhibit diminishing returns (Kearnes et al., 2016).
Validation rigor: Temporal splits yield more realistic, less optimistic generalization than random or per-task splits, as highlighted in analyses of industrial datasets (Kearnes et al., 2016, Adrian et al., 14 Oct 2025).
Side information and endpoint relatedness: While adding extra assays (side information) has the potential to enhance prediction, benefits are highly “dataset-dependent” and contingent on the chemical or functional relevance of additional endpoints—excessively unrelated tasks can detract from performance (Kearnes et al., 2016).

Multitask data splits thus play a fundamental role in accurately measuring the effect of shared learning, transfer, and model robustness.

4. Algorithmic Treatment of Multitask Loss, Task Weighting, and Data Imbalance

Multitask ADMET modeling requires explicit strategies for balancing losses across tasks with varying data volume and difficulty:

Task-weighted loss functions: Common approaches scale each endpoint’s loss inversely with training set size (e.g., “task-weighted MTNNs” with cost scalars) (Kearnes et al., 2016), or adopt dynamic cost weights to address label imbalance (e.g., $cost = W_1C_1 + W_2C_2 + W_3C_3 + W_4C_4$ for pharmacokinetic parameters) (Ye et al., 2018).
Exponential sample-aware weighting: QW-MTL implements a learnable weighting scheme where each task’s contribution is scaled via $w_t = r_t^{\mathrm{softplus}(\log\beta_t)}$ such that $r_t = n_t/\sum_i n_i$ and $\mathcal{L}_{\text{total}} = \sum_t w_t \mathcal{L}_t$ (Zhang et al., 4 Sep 2025).
Adaptive optimizers for gradient conflict: AIM introduces a learned policy to mediate destructive gradient interference between tasks, optimizing inter-task relationships with a differentiable augmented objective and yielding interpretability into task compatibility (Minot et al., 30 Sep 2025).

Proper loss and weighting strategies are essential for effective multitask training in the presence of endpoint imbalance and heterogeneity.

5. Effects of Task Relatedness and Side Information

Performance gains in multitask ADMET models are closely tied to the chemical and biological relatedness between endpoints:

Quantifying relatedness: Using metrics such as label agreement among highly similar compounds, relatedness $R(\alpha, \beta)$ is estimated as $R = \frac{\max\{S(\alpha,\beta), D(\alpha,\beta)\}}{S(\alpha,\beta) + D(\alpha,\beta)}$ with $S, D$ indicating agreement/disagreement for pairs with Tanimoto similarity above a threshold (Kearnes et al., 2016).
Diminishing returns of unrelated side tasks: Integrating hundreds of weakly related endpoints can saturate or degrade model performance, suggesting that multitask benefits are maximized when tasks are chemically or functionally coupled (Kearnes et al., 2016).
Negative transfer mitigation: Adaptive policy optimizers (e.g., AIM), geometric alignment frameworks, and careful selection of auxiliary endpoints are employed to minimize negative transfer and destructive interference, especially when tasks diverge (Minot et al., 30 Sep 2025, Ko et al., 3 May 2024).

Selecting endpoints with biological or chemical affinity for multitasking is thus critical for maximizing benefit.

6. Practical Implementation and Benchmarking Considerations

Methodological advances and standardized multitask splits are enabling robust model development and acceleration:

Accelerated multitask pretraining and finetuning: Distributed data parallel (DDP) workflows in PyTorch scale large pretrained models (e.g., KERMT with 51M parameters) across multiple GPUs, achieving near-linear speedups, while batched graph generation (e.g., cuik-molmaker) reduces memory and runtime (Adrian et al., 14 Oct 2025).
Benchmarking resources: The publication of multitask and temporally split ADMET datasets provides the community with resources for fair benchmarking and enables direct comparison of multitask approaches in industrial workflows (Adrian et al., 14 Oct 2025).
Evaluation metrics: Performance is quantified by endpoint-specific metrics—Pearson $r^2$ for regression, AUC for classification, and mean absolute error or rank correlation—enabling rigorous statistical assessment of multitask improvement (Adrian et al., 14 Oct 2025, Tian et al., 2022, Zhang et al., 4 Sep 2025).

These practices collectively facilitate model reproducibility, scalability, and industrial applicability.

7. Limitations, Future Directions, and Domain-Specific Implications

Key limitations and research trajectories include:

Dataset dependency and scalability challenges: Multitask effects are not universally positive; gains depend on data size, endpoint relatedness, and weighting strategy. Computational complexity scales with task number, requiring efficient algorithms and architectural choices (Kearnes et al., 2016, Adrian et al., 14 Oct 2025, Ko et al., 3 May 2024).
Tailored approaches: Optimal model architecture, task aggregation, and side information inclusion must be adapted to each ADMET use-case, as evidenced by endpoint-specific trends and diminishing multitask returns (Kearnes et al., 2016).
Diagnostic insight and automated curriculum: Interpretable policy matrices (e.g., AIM) may enable automated grouping of synergistic endpoints and design of curricula for progressive multitask training (Minot et al., 30 Sep 2025).
Benchmark evolution: Ongoing publication and expansion of multitask ADMET data splits will enable further advancement in benchmarking, generalization studies, and diagnostic tool development.

A plausible implication is that, despite sizable gains from multitask learning, future models must judiciously balance shared representation, endpoint selection, adaptive loss strategies, and computational efficiency to realize domain-resilient ADMET prediction in both academic and industrial settings.