Dice Question Streamline Icon: https://streamlinehq.com

Assess performance of open-source chemistry LLMs on negative SMARTS constraints

Determine the ability of the open-source chemistry language models BioT5 and MoleculeSTM, when used within the MolLEO framework, to satisfy negative matching constraints in molecular generation tasks that require excluding specified SMARTS-defined scaffolds or substructures (e.g., generating molecules that do not contain the scaffold [#7]-c1n[c;h1]nc2[c;h1]c(-[#8])[c;h0][c;h1]c12), and quantify their performance on tasks such as deco_hop and scaffold_hop.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper evaluates MolLEO, which integrates chemistry-aware LLMs into evolutionary algorithms, across multiple single-objective and multi-objective molecular optimization tasks. Among the single-objective tasks, deco_hop and scaffold_hop require precise control over substructure presence and absence using SMARTS patterns, including negative constraints (ensuring certain motifs are not present).

While MolLEO (GPT-4) shows strong overall results, the open-source variants using BioT5 and MoleculeSTM exhibit only small gains on these substructure tasks. The authors note that these models may not have been trained extensively on molecular descriptions containing SMARTS and explicitly state uncertainty about their capability to handle negative matching constraints, motivating a targeted assessment of this capability.

References

Also, it is unclear how well these models perform with negative matching (e.g., This molecule does not contain the scaffold [#7]-c1n[c;h1]nc2 [c;h1]c(-[#8])[c;h0][c;h1]c12).

Efficient Evolutionary Search Over Chemical Space with Large Language Models (2406.16976 - Wang et al., 23 Jun 2024) in Section 4.2 Empirical Study (Single-objective results), paragraph discussing deco_hop and scaffold_hop (following Table 1)