DeepTox: Deep Learning for Toxicity Prediction

Updated 29 January 2026

DeepTox is an ensemble-based deep learning system that integrates multiple feed-forward neural networks to predict binary molecular toxicity endpoints using engineered descriptors.
Its multitask architecture and ensemble averaging approach enable robust predictions across twelve toxicity assays, outperforming traditional QSAR models on the Tox21 benchmark.
DeepTox’s success underscores the lasting impact of descriptor-based methods, setting a high performance baseline despite recent advances in graph and pretrained models.

DeepTox is an ensemble-based computational system that marked a key milestone in molecular toxicity prediction by winning the 2015 Tox21 Data Challenge. It represents one of the earliest and most influential applications of deep learning to quantitative structure–activity relationship (QSAR) modeling, specifically targeting the prediction of twelve binary toxicity endpoints for small molecules. DeepTox employs only engineered molecular descriptors as inputs, harnessing a multitask architecture comprised of multiple feed-forward neural networks and aggregating predictions via ensemble averaging. Notably, reevaluations indicate that DeepTox remains among the top-performing models on the original Tox21 benchmark, calling into question the extent of progress in the field over the past decade (Ebner et al., 18 Nov 2025).

1. Model Architecture and Multitask Design

Each constituent network within the DeepTox ensemble adheres to a standard feed-forward multitask structure. The input is a fixed-length descriptor vector $x \in \mathbb{R}^{9385}$ . Successive hidden layers (sizes $H_1, H_2, \ldots$ with nonlinearities such as ReLU) allow the network to model complex mappings from input features to outputs. The output layer comprises 12 sigmoid units, each estimating the probability $\hat y_t \in [0, 1]$ for a specific toxicity assay $t \in \{1, \ldots, 12\}$ . Multiple such networks ( $M$ in total) are trained independently, each using different bootstrapped samples, random initializations, and descriptor subset selections to cultivate diversity. During inference, predictions for each endpoint are averaged across the ensemble:

$\hat y_t^{\text{ensemble}}(x) = \frac{1}{M} \sum_{j=1}^M \hat y_{t}^{(j)}(x)$

for $t = 1, \ldots, 12$ .

2. Molecular Descriptor Inputs

DeepTox exclusively utilizes engineered descriptors, eschewing any learned or graph-based representations. The feature vector comprises four concatenated blocks:

Descriptor Type	Dimensionality	Description
ECFP6 count fingerprints	8192	Counts of circular substructures, radius 3
MACCS keys	166	Binary structural substructure keys
RDKit physicochemical feats	200	Real-valued calculated properties
Toxicity pattern features	827	Binary/integer encodings of structural toxicity alerts

Resulting in $D = 8192 + 166 + 200 + 827 = 9385$ features. Prior to training, continuous descriptors (e.g., RDKit features) may be normalized to zero mean and unit variance using training set statistics:

$\tilde f^{(j)} = \frac{f^{(j)} - \mu_j}{\sigma_j}$

for $j$ in RDKit feature indices.

3. Loss Function and Optimization Procedure

Networks are optimized with a multitask binary cross-entropy objective. Given $y_{i,t}$ as the true label (0/1) for molecule $i$ and assay $t$ , with $\hat y_{i,t}$ the predicted probability, the per-example loss is

$\mathcal{L}_\text{CE}(y_i, \hat y_i) = -\sum_{t=1}^{12} \bigl[ y_{i,t} \log \hat y_{i,t} + (1 - y_{i,t}) \log (1 - \hat y_{i,t}) \bigr]$

Unknown or missing labels are masked and excluded from the sum. Training uses stochastic gradient descent, notably the Adam optimizer with hyperparameters $\eta \approx 10^{-3}$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , and applies weight decay (L2 regularization). Early stopping based on a holdout validation set is used to mitigate overfitting. Hyperparameters such as hidden layer size, dropout probability, and regularization magnitude are determined by cross-validation.

4. Ensemble Training, Aggregation, and Inference

The ensemble comprises $M$ independently trained networks, each trained on variations of the input data (via bootstrap resampling) and utilizing different random seed initializations and descriptor subsets. For an input $x$ , each network $\mathcal{N}^{(j)}$ produces probability estimates for all 12 endpoints. Final predictions per assay are obtained by arithmetic averaging over the networks. This ensemble approach achieves robust and competitive toxicity prediction performance, notably outperforming single models and traditional QSAR pipelines when evaluated on the original Tox21 cluster-based test split.

5. Benchmark Performance and Comparative Evaluation

Performance is evaluated by computing the area under the receiver operating characteristic curve (ROC-AUC) for each endpoint and averaging (macro-AUC). Key results on the Tox21 test set illustrate the comparative performance of DeepTox and representative contemporary methods:

Model	Overall Macro-AUC	Notable Points
DeepTox (2016)	0.846	Top performer; ensemble method
SNN (2017)	0.844 ± 0.002	Single self-normalizing network
Random Forest	0.829 ± 0.001	Competitive tree-based ensemble
XGBoost	0.823 ± 0.010	Boosted tree ensemble
Chemprop (MPNN)	0.815 ± 0.005	Graph-based message passing NN
GIN (GNN)	0.811 ± 0.003	Graph isomorphism network
TabPFN	0.807 ± 0.005	Pretrained transformer for tables
GPT-OSS (zero-shot)	0.703 ± 0.033	Zero-shot LLM

DeepTox’s performance matches or surpasses modern graph-based, classical, and pretrained models in this setting (Ebner et al., 18 Nov 2025).

6. Significance, Limitations, and Field Stagnation

DeepTox’s success established a high baseline for toxicity prediction under data-constrained, descriptor-based regimes. As of the 2025 reevaluation, descriptor-based multitask networks remain highly competitive. Advances such as graph neural networks and pretrained foundation models, though promising, have not consistently surpassed DeepTox on the original Tox21 data. This suggests:

Descriptor-based approaches remain difficult to outperform for small-to-moderate datasets like Tox21.
Graph and representation learning methods require careful tuning and potentially access to larger or more diverse data to excel.
Genuine progress in toxicity prediction may be limited without new assay modalities or larger datasets; the field exhibits evidence of a performance plateau since 2015.

A plausible implication is that innovation in small-molecule toxicity prediction may be reaching diminishing returns on legacy benchmarks such as Tox21 (Ebner et al., 18 Nov 2025).

7. Lasting Impact and Benchmarking Considerations

DeepTox catalyzed widespread adoption of deep learning in pharmaceutical research pipelines and remains a strong baseline for toxicity prediction. The persistence of its performance underscores the necessity for rigorous and reproducible benchmarking on unaltered datasets. The original Tox21 dataset’s subsequent modifications in major benchmarks (e.g., MoleculeNet, Open Graph Benchmark) have introduced label imputation and other changes, complicating fair comparison across studies. The creation of reproducible leaderboards with the authentic dataset and public APIs, as exemplified in recent work, is essential to accurately measure methodological progress and maintain scientific rigor in the field (Ebner et al., 18 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Measuring AI Progress in Drug Discovery: A Reproducible Leaderboard for the Tox21 Challenge (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepTox.