Analyzing Learned Molecular Representations for Property Prediction (1904.01561v5)

Published 2 Apr 2019 in cs.LG and stat.ML

Abstract: Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 16 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently matches or outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.

Authors (15)

Kevin Yang (45 papers)
Kyle Swanson (9 papers)
Wengong Jin (25 papers)
Connor Coley (6 papers)
Philipp Eiden (1 paper)
Hua Gao (45 papers)
Angel Guzman-Perez (1 paper)
Timothy Hopper (1 paper)
Brian Kelley (10 papers)
Miriam Mathea (5 papers)
Andrew Palmer (1 paper)
Volker Settels (2 papers)
Tommi Jaakkola (115 papers)
Klavs Jensen (1 paper)
Regina Barzilay (106 papers)

Citations (1,175)

View on Semantic Scholar

Summary

The paper introduces the novel D-MPNN model that utilizes directed bond-level message passing to enhance molecular property prediction.
It benchmarks learned molecular representations against traditional descriptors on 35 public and proprietary datasets, demonstrating significant performance gains.
Ablation studies and hyperparameter optimization validate the model's design choices, achieving up to a 37% improvement in key performance metrics.

Analyzing Learned Molecular Representations for Property Prediction

In "Analyzing Learned Molecular Representations for Property Prediction," Yang et al. investigate the performance of message-passing neural networks (MPNNs) versus traditional descriptors and molecular fingerprints in predicting molecular properties. Their work benchmarks these methods extensively across 35 datasets, including both public and proprietary data, highlighting the strengths and potential of graph convolutional neural networks (GCNNs).

Background and Motivation

Molecular property prediction is a cornerstone of cheminformatics, playing a crucial role in diverse applications such as drug discovery and materials science. Traditional approaches relying on expert-crafted descriptors and machine learning models like support vector machines and random forests have demonstrated significant utility. However, recent advancements in deep learning, particularly GCNNs, have shown promise by learning task-specific molecular representations directly from the molecular graph. Prior work has yielded conflicting results regarding the superiority of learned representations versus fixed descriptors, motivating this paper's comprehensive comparison.

Methodology

The authors introduce a novel graph convolutional model that operates on directed edges and combines both learned representations and fixed molecular descriptors. Their model, the Directed Message Passing Neural Network (D-MPNN), aims to improve upon previous MPNN approaches by focusing on bond-level message passing and avoiding unnecessary message-passing loops inherent in atom-based updates.

The evaluation comprises 19 public datasets spanning quantum mechanics, physical chemistry, biophysics, and physiology, as well as 16 proprietary datasets from Amgen, Novartis, and BASF. These datasets encompass a wide range of regression and classification tasks, providing a rigorous benchmark for assessing model performance. The evaluation setup includes both scaffold-based and random splits to mimic real-world application scenarios more closely.

Numerical Results

The D-MPNN consistently matches or outperforms traditional baseline models and previous GCNN architectures on the benchmark datasets. Key results include:

Public Datasets: The D-MPNN shows superior performance on 11 out of 19 datasets, with significant improvements noted especially in the QM9, ESOL, and FreeSolv datasets.
Proprietary Datasets: D-MPNN outperforms existing models on 15 out of 16 industrial datasets, underscoring its applicability in real-world industrial workflows.

For example, on the Amgen dataset predicting Rat Plasma Protein Binding (rPPB), D-MPNN achieves better root-mean-square error (RMSE) compared to both the random forest and feed-forward network baselines.

Model Features and Ablations

Various ablation studies substantiate the efficacy of the D-MPNN's design choices:

Message Passing Paradigm: Analysis shows that directed bond-centered message passing offers better performance compared to atom-centered or undirected bond-centered methods, especially in preventing information loss and redundant updates.
Incorporation of Molecular Descriptors: The inclusion of RDKit-calculated features enhances performance, particularly in datasets with smaller sample sizes, suggesting these features provide valuable a priori chemical knowledge.
Hyperparameter Optimization and Ensembling: Bayesian optimization and model ensembling further improve the D-MPNN's predictive accuracy, with some datasets showing up to a 37% improvement in performance metrics.

Practical and Theoretical Implications

The empirical findings suggest that hybrid representations combining learned graph-based features and fixed descriptors provide a robust framework for molecular property prediction. Practically, this enables better generalization in industrial settings, as evidenced by the consistent performance across diverse proprietary datasets. Theoretically, the paper underscores the importance of scaffold-based split evaluations as proxies for chronological splits, offering a more realistic measure of a model's generalization capability to novel chemical spaces.

Future Prospects

Despite the promising results, the research highlights areas for further exploration:

Integration of 3D Structural Information: Current models primarily utilize 2D graph structures; incorporating 3D coordinates is expected to improve performance further, particularly in datasets where spatial configuration is critical.
Pretraining and Transfer Learning: Adopting pretraining approaches on large molecular databases could enhance model performance on limited data tasks, offering a pathway for more generalized chemical understanding.
Handling Extreme Class Imbalance: The MUV dataset results indicate the need for more robust techniques to address datasets with severe class imbalances.

Conclusion

The paper by Yang et al. provides a compelling case for the efficacy of hybrid molecular representations using advanced GCNN models for property prediction. By rigorously benchmarking across public and private datasets, the research establishes D-MPNN as a formidable tool in cheminformatics, ready for integration into industrial practice. Future advancements are poised to further optimize these models, enhancing their applicability and accuracy in increasingly complex chemical discovery tasks.

PDF Markdown