Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Deep Generative Model for Fragment-Based Molecule Generation (2002.12826v1)

Published 28 Feb 2020 in stat.ML and cs.LG

Abstract: Molecule generation is a challenging open problem in cheminformatics. Currently, deep generative approaches addressing the challenge belong to two broad categories, differing in how molecules are represented. One approach encodes molecular graphs as strings of text, and learns their corresponding character-based LLM. Another, more expressive, approach operates directly on the molecular graph. In this work, we address two limitations of the former: generation of invalid and duplicate molecules. To improve validity rates, we develop a LLM for small molecular substructures called fragments, loosely inspired by the well-known paradigm of Fragment-Based Drug Design. In other words, we generate molecules fragment by fragment, instead of atom by atom. To improve uniqueness rates, we present a frequency-based masking strategy that helps generate molecules with infrequent fragments. We show experimentally that our model largely outperforms other LLM-based competitors, reaching state-of-the-art performances typical of graph-based approaches. Moreover, generated molecules display molecular properties similar to those in the training sample, even in absence of explicit task-specific supervision.

A Deep Generative Model for Fragment-Based Molecule Generation

In the context of cheminformatics, the generation of novel molecules with desirable properties remains a formidable challenge. Notably, two primary methodologies have emerged for addressing this issue using deep generative models: one that encodes molecular graphs as sequences of characters, and another that directly operates on molecular graphs. The latter typically holds state-of-the-art results due to its expressiveness but is computationally intensive. This paper introduces a novel approach to overcome the limitations of character-based LLMs (LMs), specifically targeting the issues of generating invalid and duplicate molecules.

The proposed solution involves a new fragment-based LLM for molecule generation, drawing inspiration from Fragment-Based Drug Design (FBDD). Instead of generating molecules atom by atom, this model generates them fragment by fragment. The primary advantage of this approach is the higher validity rate, as fragments are inherently chemically meaningful substructures. To further enhance the uniqueness of the generated molecules, the authors introduce a frequency-based masking strategy known as Low-Frequency Masking (LFM), which encourages the generation of less common fragments.

Methodology

Molecule Fragmentation

The approach begins by fragmenting molecules into smaller substructures using the Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) algorithm. This method ensures chemically significant bonds are identified and broken appropriately, leading to a sequence of fragments that can later be recombined to form the original molecule.

Fragment Embedding

Each fragment is embedded into a continuous vector space using a skip-gram model. This enables the transformation of a sequence of fragments into a sequence of vector representations, thus maintaining the contextual proximity of fragments that frequently appear together.

Encoder-Decoder Architecture

The core of the model is an encoder-decoder architecture utilizing Gated Recurrent Units (GRUs). The encoder compresses the sequence of fragment embeddings into a latent representation. The decoder, initialized by sampling from a Gaussian distribution in the latent space, reconstructs the molecule by generating fragment sequences. The model is trained using an objective function that minimizes the Kullback-Leibler (KL) divergence for the encoder and the negative log-likelihood for the decoder.

Low-Frequency Masking (LFM)

The LFM strategy is employed to mitigate duplicative generation. By masking fragments with a frequency below a specified threshold, the model replaces less frequent fragments with tokens indicating their frequency and connection points. During generation, these tokens are replaced with the actual fragments, thereby promoting diversity.

Results

The experimental validation demonstrates that the fragment-based model substantially improves over character-based LMs by achieving perfect validity rates for both the ZINC and PCBA datasets. Specifically, the model variant utilizing LFM achieves near-state-of-the-art uniqueness rates (0.998 for ZINC and 0.972 for PCBA) and maintains high novelty. This is achieved despite utilizing an inherently less expressive intermediate representation compared to graph-based approaches.

The distribution of various molecular properties and structural features in generated samples show strong resemblance to the training data, maintaining chemical realism without explicit task-specific supervision.

Implications and Future Directions

The proposed fragment-based LLM offers a significant step forward in improving the validity and diversity of generated molecules in LM-based generative models. This enhances the practical applicability of such models in drug discovery pipelines, particularly in the initial phases of virtual screening and lead optimization. Future research can build on these findings by extending the approach to more complex tasks, such as molecule optimization, which might require sophisticated strategies to maintain high uniqueness while ensuring smooth transitions in the latent space.

Another promising direction is to adapt the fragment-based paradigm to graph-based molecular generators, potentially combining the efficiency and scalability of the proposed method with the expressiveness of graph-based models to achieve superior performance.

In summary, this paper presents a compelling advancement in fragment-based molecule generation, setting a new milestone in leveraging deep generative models for cheminformatics applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Marco Podda (10 papers)
  2. Davide Bacciu (107 papers)
  3. Alessio Micheli (30 papers)
Citations (48)