Systematic Use of Model Likelihoods to Navigate Large De Novo Molecular Design Libraries

Determine systematic procedures for using model likelihoods to navigate large libraries of de novo molecular designs generated as SMILES strings by chemical language models (including LSTM-, GPT-, and S4-based models), with the goal of enabling robust evaluation and prioritization of candidates for prospective drug discovery studies.

Background

Generative drug discovery often yields very large libraries of candidate molecules, making ranking and selection challenging and prone to subjective bias. The authors propose leveraging model likelihoods, which quantify how well a generated SMILES sequence aligns with the learned probability distribution, as a model-intrinsic, cost-efficient score to rationalize library analysis.

While likelihoods have been introduced in prior de novo design work, the paper highlights that it remains unresolved how to employ them systematically for navigating and triaging large design libraries. Addressing this gap would help tune exploration–exploitation trade-offs, identify low-quality repetitive generations, and improve the robustness of molecule selection for follow-up experiments.

References

Albeit likelihoods have already been introduced for de novo design, an open question remains as to how they can be used systematically navigate large designlibraries.

The Jungle of Generative Drug Discovery: Traps, Treasures, and Ways Out (2501.05457 - Özçelik et al., 24 Dec 2024) in Results and discussion, Subsection “Navigating large design libraries with likelihoods”