Therapeutics Data Commons (TDC)

Updated 23 February 2026

Therapeutics Data Commons (TDC) is a unified, open-science platform that curates, benchmarks, and structures ML datasets and tasks for drug discovery.
It provides 66 AI-ready datasets organized into 22 learning tasks covering single-instance, multi-instance, and generative challenges with diverse modalities.
The platform addresses real-world issues like distributional shift and data fragmentation through strategic data splits, thorough evaluation metrics, and community extensibility.

Therapeutics Data Commons (TDC) is a unified, open-science platform that systematically structures, curates, and benchmarks machine learning datasets and tasks spanning the full therapeutic development pipeline. TDC provides the infrastructure and resources—including datasets, task definitions, data-processing functions, task-appropriate splits, evaluation protocols, model benchmarks, and community tools—required to advance machine learning methods for drug discovery, design, and development. By standardizing data and ML task formulation, TDC addresses major bottlenecks in the translation of computational methods to real-world therapeutic challenges, specifically distributional shift, diversity of modalities, and the critical need for robust generalization to novel compounds and targets (Huang et al., 2021).

1. Objectives and Motivation

The genesis of TDC is rooted in persistent challenges in therapeutic machine learning: a lack of AI-ready, standardized datasets; fragmented and heterogeneous data curation across multiple sources; and a dearth of unified, task-representative evaluation protocols. Drug development remains costly (13–15 years, >$2 billion per molecule), yet most ML approaches lack reproducibility and true generalization due to poorly benchmarked data and absence of deployment-mimicking splits (e.g., scaffolds, chronology, unseen targets). TDC aims to operationalize data into scientifically valid ML challenges by:

Curating datasets once and exposing them via a stable Python interface.
Formalizing biologically and clinically relevant predictive/generative tasks across modalities (small molecules, proteins, clinical outcomes).
Emphasizing evaluation under real-world generalization regimes through meaningful splits and multi-metric assessment.
Enabling community benchmarking and extensibility (adding new tasks, evaluation strategies, or data types).

These principles establish TDC as an anchor framework for rigorous, reproducible ML/AI research in therapeutics (Huang et al., 2021).

2. Dataset Hierarchy and Task Structure

TDC encompasses 66 AI-ready datasets grouped into 22 learning tasks, hierarchically organized by problem type, task, and dataset:

Problem types:
- Single-instance prediction: e.g., ADME property prediction, toxicity, quantum properties.
- Multi-instance prediction: e.g., drug–target interaction (DTI), drug–drug interaction (DDI), protein–protein interaction (PPI), gene-disease association.
- Generation: molecule generation, retrosynthesis, chemical reaction outcome.
Learning tasks (examples):
- ADMET properties: e.g., Caco2 permeability (Caco2_Wang), aqueous solubility (Solubility_AqSolDB), lipophilicity, CYP inhibition.
- DTI: BindingDB_Kd, DAVIS, KIBA (split by cold-start, scaffold, temporal).
- Drug synergy, antibody paratope/epitope prediction, peptide–MHC binding, retrosynthesis.
- Generative tasks: MOSES, ZINC, ChEMBL for molecule generation.
Dataset scale:
- Sets range from hundreds (e.g., Caco2_Wang, 906 samples) up to nearly 2 million (ChEMBL).

This structure permits fine-grained comparison within and across therapeutic modalities and problem types (Huang et al., 2021).

3. Data Functions and Splitting Strategies

TDC exposes a suite of 33 data utilities to facilitate transformation and robust ML workflow orchestration:

Format conversion: SMILES/SELFIES ↔ graph/3D; Morgan/RDKit fingerprints; feature extraction for non-structure data.
Processing: balancing, re-labeling, unit conversion, database querying.
Visualization: label density, scaffold distribution.
Negative sampling: for interaction tasks.

TDC provides five empirically validated split strategies to mimic deployment and out-of-distribution (OOD) scenarios:

Split type	Applicability	Purpose
Random	Any dataset	Baseline; does not stress generalization
Scaffold	Small molecule tasks	Partition by Murcko scaffold to enforce chemotype novelty
Cold-Start	Multi-instance (DTI, etc.)	OOD entity (e.g., unseen proteins, drugs) in test/val
Combinatorial	Drug-pair tasks	Unseen combinations (e.g., for synergy prediction)
Temporal	Longitudinal/patent-dated assays	Simulates prospective performance, tests time-related drift

This toolkit is essential for evaluating generalization beyond trivial random splits, especially on tasks such as lead optimization or target hopping (Huang et al., 2021).

4. Evaluation Protocols and Benchmark Groups

TDC unifies benchmarking via 23 metrics, implemented in standardized Evaluator classes:

Regression: Mean Squared Error (MSE), Mean Absolute Error (MAE), $R^2$ , Pearson and Spearman correlation.
Binary classification: AUROC, AUPRC, accuracy, precision, recall, PR@K, RP@K.
Multi-class/multi-label: Micro/Macro-F1, Cohen’s Kappa.
Generative tasks: validity, uniqueness, novelty, diversity (Tanimoto), KL divergence, Fréchet ChemNet Distance (FCD).

Public leaderboards are organized into 29 “benchmark groups” (e.g., ADMET_Group for 22 endpoint properties with scaffold split, DTI_DG for DTI with temporal split and multi-metric evaluation). Each group prescribes datasets, splits, and evaluation criteria, enforcing apples-to-apples comparison and discouraging cherry-picking.

For example, on the 22 ADMET datasets with scaffold splits, models are ranked by MAE/Spearman for regression and AUROC/AUPRC for classification, with group average cross-endpoint metrics determining leaderboard ranking (Huang et al., 2021).

5. Generation Oracles and Realistic Molecule Design

TDC operationalizes molecule generation evaluation through 17 oracle functions:

Simple structural heuristics: QED, penalized LogP, synthetic accessibility (SA).
Distribution-learning tasks: rediscovery, scaffold hops (GuacaMol).
Docking meta-oracle: standardized access to AutoDock Vina, smina, QuickVina2, PSOVina, DOCK6 via pyscreener for binding-affinity evaluation.
Retrosynthesis-based synthesizability: ASKCOS, Molecule.one, IBM RXN for chemo-informatic feasibility.
Target-specific bioactivity classifiers: random forests for GSK3β, JNK3, SVM for DRD2.

This enables generative models to be optimized and scored against practical properties—beyond LogP or QED—such as docking affinity and synthetic accessibility, crucial for advancing drug-like molecule generation (Huang et al., 2021).

6. Empirical Insights and Performance Analysis

Analysis of TDC leaderboards reveals persistent gaps between SoTA models and deployment-quality solutions:

ADMET scaffold-split benchmarks: Graph-neural networks with self-supervised pretraining (e.g., ContextPred, AttrMasking) deliver gains relative to SMILES-CNNs, yet expert-crafted descriptors (RDKit2D, Morgan fingerprints) frequently outperform deep methods on structurally novel test sets (e.g., Caco2_Wang MAE: 0.393 for RDKit2D vs. 0.401 for AttentiveFP).
DTI with temporal split: Out-of-distribution Pearson’s correlation degrades from ~0.70 in-distribution to 0.42–0.43 with sophisticated DG methods (ERM, MMD, CORAL), indicating substantial inability to anticipate novel proteins/drugs after time drift.
Generative docking benchmarks: With limited oracle calls (≤1000), no generative or ML method surpasses the best-in-dataset hit (-12.08 kcal/mol), highlighting a trade-off between generated molecule novelty and synthesizability as calls increase. Only at 5k+ calls do some methods (Graph-GA) exceed prior best virtual-screening scores (-14.81).

These results demonstrate the value of TDC’s design: by simulating generalization and prospective deployment conditions, it uncovers where ML generalizes and where it fails (Huang et al., 2021).

7. Challenges, Extensions, and Best Practices

TDC is iteratively addressing key open challenges:

Distributional shift: Scaffold, cold-start, combinatorial, and temporal splits modularize major OOD scenarios (novel chemotypes, new targets, time drift).
Modality/multi-scale generalization: Tasks span small molecules, peptides, antibodies, proteins, gene-editing, and clinical context, forcing the development of methods that integrate graphs, sequences, and tabular/omics.
Community extensibility: Python API for reproducible data loading, benchmarking, and extending with new datasets, evaluation strategies, or modalities.

Recommended best practices include: always using split-type reflecting deployment (scaffold, cold-start); reporting multiple metrics for multi-faceted performance; benchmarking across full task groups; leveraging relevant oracles for downstream goal-targeting; and using strong domain-specific baselines before exploring complex ML.

As the field progresses, TDC’s open-source repository supports the addition of new modalities (e.g., PROTACs, ADCs), integrating single-cell or clinical data (per PyTDC (Velez-Arce et al., 8 May 2025)), and adopting large-scale multimodal and knowledge-augmented approaches (as exemplified by MedEx (Jones et al., 14 Aug 2025) and Otter-Knowledge (Lam et al., 2023)) to further drive innovation and robust benchmarking (Huang et al., 2021, Velez-Arce et al., 8 May 2025, Jones et al., 14 Aug 2025, Lam et al., 2023).

For comprehensive documentation, datasets, standardized code listings, benchmark submissions, and community extensions, see https://tdcommons.ai (Huang et al., 2021).