A Deep Generative Model for Fragment-Based Molecule Generation
In the context of cheminformatics, the generation of novel molecules with desirable properties remains a formidable challenge. Notably, two primary methodologies have emerged for addressing this issue using deep generative models: one that encodes molecular graphs as sequences of characters, and another that directly operates on molecular graphs. The latter typically holds state-of-the-art results due to its expressiveness but is computationally intensive. This paper introduces a novel approach to overcome the limitations of character-based LLMs (LMs), specifically targeting the issues of generating invalid and duplicate molecules.
The proposed solution involves a new fragment-based LLM for molecule generation, drawing inspiration from Fragment-Based Drug Design (FBDD). Instead of generating molecules atom by atom, this model generates them fragment by fragment. The primary advantage of this approach is the higher validity rate, as fragments are inherently chemically meaningful substructures. To further enhance the uniqueness of the generated molecules, the authors introduce a frequency-based masking strategy known as Low-Frequency Masking (LFM), which encourages the generation of less common fragments.
Methodology
Molecule Fragmentation
The approach begins by fragmenting molecules into smaller substructures using the Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) algorithm. This method ensures chemically significant bonds are identified and broken appropriately, leading to a sequence of fragments that can later be recombined to form the original molecule.
Fragment Embedding
Each fragment is embedded into a continuous vector space using a skip-gram model. This enables the transformation of a sequence of fragments into a sequence of vector representations, thus maintaining the contextual proximity of fragments that frequently appear together.
Encoder-Decoder Architecture
The core of the model is an encoder-decoder architecture utilizing Gated Recurrent Units (GRUs). The encoder compresses the sequence of fragment embeddings into a latent representation. The decoder, initialized by sampling from a Gaussian distribution in the latent space, reconstructs the molecule by generating fragment sequences. The model is trained using an objective function that minimizes the Kullback-Leibler (KL) divergence for the encoder and the negative log-likelihood for the decoder.
Low-Frequency Masking (LFM)
The LFM strategy is employed to mitigate duplicative generation. By masking fragments with a frequency below a specified threshold, the model replaces less frequent fragments with tokens indicating their frequency and connection points. During generation, these tokens are replaced with the actual fragments, thereby promoting diversity.
Results
The experimental validation demonstrates that the fragment-based model substantially improves over character-based LMs by achieving perfect validity rates for both the ZINC and PCBA datasets. Specifically, the model variant utilizing LFM achieves near-state-of-the-art uniqueness rates (0.998 for ZINC and 0.972 for PCBA) and maintains high novelty. This is achieved despite utilizing an inherently less expressive intermediate representation compared to graph-based approaches.
The distribution of various molecular properties and structural features in generated samples show strong resemblance to the training data, maintaining chemical realism without explicit task-specific supervision.
Implications and Future Directions
The proposed fragment-based LLM offers a significant step forward in improving the validity and diversity of generated molecules in LM-based generative models. This enhances the practical applicability of such models in drug discovery pipelines, particularly in the initial phases of virtual screening and lead optimization. Future research can build on these findings by extending the approach to more complex tasks, such as molecule optimization, which might require sophisticated strategies to maintain high uniqueness while ensuring smooth transitions in the latent space.
Another promising direction is to adapt the fragment-based paradigm to graph-based molecular generators, potentially combining the efficiency and scalability of the proposed method with the expressiveness of graph-based models to achieve superior performance.
In summary, this paper presents a compelling advancement in fragment-based molecule generation, setting a new milestone in leveraging deep generative models for cheminformatics applications.