N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules (1806.09206v2)

Published 24 Jun 2018 in cs.LG and stat.ML

Abstract: Machine learning techniques have recently been adopted in various applications in medicine, biology, chemistry, and material engineering. An important task is to predict the properties of molecules, which serves as the main subroutine in many downstream applications such as virtual screening and drug design. Despite the increasing interest, the key challenge is to construct proper representations of molecules for learning algorithms. This paper introduces the N-gram graph, a simple unsupervised representation for molecules. The method first embeds the vertices in the molecule graph. It then constructs a compact representation for the graph by assembling the vertex embeddings in short walks in the graph, which we show is equivalent to a simple graph neural network that needs no training. The representations can thus be efficiently computed and then used with supervised learning methods for prediction. Experiments on 60 tasks from 10 benchmark datasets demonstrate its advantages over both popular graph neural networks and traditional representation methods. This is complemented by theoretical analysis showing its strong representation and prediction power.

Citations (173)

View on Semantic Scholar

Summary

The paper introduces the N-gram graph method, an unsupervised approach that efficiently represents molecules using vertex embeddings from short walks.
It leverages n-gram based graph representations to capture walk-count statistics, offering a simpler alternative to complex graph neural networks.
Empirical tests across 60 tasks demonstrate that the method achieves competitive or superior predictive performance while remaining computationally efficient.

An Overview of the N-Gram Graph Method for Molecules

The paper "N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules," proposes a novel approach to represent molecules using graph-based unsupervised learning methods. This research is particularly significant in the context of predicting molecular properties, which is crucial for applications such as drug discovery and virtual screening.

Introduction and Motivation

The task of predicting the properties of molecules has become increasingly relevant with many applications in the fields of chemistry, biology, and materials engineering. Traditional methods for drug discovery often rely on physical screening, which, despite being reliable, are costly and time-intensive. Virtual screening methods, utilizing machine learning algorithms, promise to expedite this process significantly by enabling rapid predictions on vast molecular databases. The primary challenge lies in constructing suitable representations of molecules that can be effectively utilized by learning algorithms.

Existing representation techniques can be broadly categorized into chemical fingerprints and graph neural networks (GNNs). Chemical fingerprints, notably the Morgan fingerprints, are simple and efficient, allowing their usage across various machine learning methods. Conversely, GNNs are computationally intricate and require significant labeled data for training, limiting their flexibility across different tasks.

The N-Gram Graph Approach

The authors introduce the N-gram graph method, which offers an unsupervised, simple, and efficient representation for molecules. This technique constructs the graph representation by embedding the vertices based on their attributes and assembling these embeddings in short walks throughout the graph. The innovative aspect of this approach is its equivalence to graph neural networks that require no training, thereby maintaining computational efficiency.

The representation involves enumerating "n-grams," which are defined as walks of length $n$ across the molecule graph. The embeddings obtained are based on the product of vertex embeddings within these walks, providing a compact and efficient method for obtaining graph-level representations. The proposed N-gram graph representation is unsupervised, allowing for flexibility in its application across different machine learning tasks without needing to retrain for each task.

Theoretical Analysis and Empirical Evaluation

The theoretical claims made by this paper are robust, showing that the N-gram graph representations preserve the count statistics of walks in the graph, thus ensuring strong representation and prediction capabilities. This theoretical foundation assures competitive performance against any classifier that might use count statistics of walk features.

Empirical evaluations conducted across 60 tasks from 10 benchmark datasets demonstrate the N-gram graph's superiority over both classic representation methods and popular graph neural networks. Remarkably, it provides a computationally efficient alternative while delivering enhanced or comparable predictive performance. Furthermore, the transferability of the vertex embeddings across datasets was noted as a beneficial feature, indicating that randomly initialized vertex embeddings could still lead to reasonable performance.

Implications and Future Directions

The N-gram graph method not only provides a new avenue for efficient molecular representation but also has implications for the design of unsupervised learning approaches for complex graph-structured data beyond molecules, such as social networks or other domains where data can be represented as vertices and edges.

Future research may explore its appliance to broader structured data contexts, advancements for pre-training vertex embeddings, and optimizations involving the hyperparameters to further refine its effectiveness and efficiency. Additionally, integrating multi-task learning capacities in this framework could enhance its application scope within the field of drug discovery.

In conclusion, the paper presents a compelling and well-founded alternative to the current methods of molecular representation. The simplicity and efficiency of the N-gram graph method, complemented by its strong theoretical validation and empirical performance, make it a worthy candidate for extensive deployment in virtual screening and related predictive tasks in computational chemistry.

PDF Markdown