Prompt-Based IAT Analogues Discovery
- Prompt-based IAT analogues are a novel approach that uses transformer-based chemical language models and diverse SMILES canonicalizations to identify structurally distinct yet functionally similar molecules.
- The method decouples the embedding processes of query and database molecules by employing alternative canonicalization strategies, thereby revealing non-obvious analogues beyond traditional similarity metrics.
- Empirical evaluations demonstrate that approaches like OEChem canonicalization yield lower Tanimoto coefficients while surfacing functionally validated analogues, offering practical insights for advanced chemical similarity searches.
Prompt-based IAT analogues refer to structurally and functionally related molecules to IAT (or any target molecule) identified using a chemical LLM (CLM) with prompt engineering strategies, specifically leveraging alternative SMILES canonicalization algorithms. This method, as detailed in "0" (Kosonocky et al., 2023), enables the discovery of nontrivial functional analogues unlikely to be retrieved by traditional structure-based similarity searches by decoupling the chemical embedding process of the query from that of the database.
1. Model Architecture and Embedding Pipeline
A transformer-based chemical LLM, ChemBERTa (a BERT encoder), forms the core of the approach. ChemBERTa comprises 12 transformer layers (hidden size 768, 12 attention heads) and operates over SMILES token vocabulary with a maximum allowable sequence length of 512. Each input SMILES string is tokenized into integer IDs, prepended with the “[CLS]” token and appended with “[SEP]”, then processed through learned token and positional embeddings. The output is a sequence of hidden states at the final layer , with the token embedding () serving as a continuous vector representation of the full molecule. Postprocessing includes L2 normalization, .
2. Prompt Engineering via Alternative SMILES Canonicalizations
Prompt engineering in this context denotes presenting the CLM with query SMILES strings canonicalized differently from those in the database, compelling the model to rely on learned chemical semantics instead of syntactic overlap.
- Database: Canonicalized using RDKit “Atom 0” (default root atom) with chirality stripped; embedded and normalized via ChemBERTa.
- Query: Subjected to one of three canonicalization schemes:
- RDKit Atom 0: Baseline identical to the database.
- RDKit Atom n: The variant with the most embedding dissimilarity (minimized cosine similarity) relative to Atom 0, by altering the root atom selection.
- OEChem: Canonicalization using OpenEye OEChem v2.3.0, employing an algorithmic approach distinct from RDKit.
- Both database and query canonicalizations drop isomeric notation and enforce a 512-token length constraint.
3. Mathematical Similarity Metrics
Several quantitative measures are employed:
- Cosine Similarity: For normalized embeddings , the similarity is with range .
- Fingerprint Tanimoto: For Morgan fingerprints , provides a structure-based baseline ().
- Gestalt String Similarity: For SMILES strings , , with the recursively determined count of matching substring characters.
4. Workflow for Discovery of Prompt-Based Analogues
The systematic procedure is as follows:
- Database Preparation: Assemble a large-scale (~10 million) achiral SMILES dataset; canonicalize all using RDKit Atom 0; tokenize, embed, extract , L2-normalize, and persist in chunks.
- Query Preparation: For IAT (or any query), obtain its SMILES and produce three canonicalizations (RDKit Atom 0, RDKit Atom n, OEChem). Each is tokenized, embedded, and normalized.
- Similarity Search: For all query variants, compute cosine similarity to all database embeddings and retrieve the top (e.g., 20) highest scorers.
- Post-Analysis:
- Compute fingerprint Tanimoto coefficient to assess structural similarity.
- Curate patent/literature (e.g., via PubChem) to determine functional similarity.
- Label hits as:
- Structural analogue: Tanimoto
- Functional analogue: supported by literature
- Structurally Distinct Functional Analogue (SDFA): functional analogue with Tanimoto0.60
- Non-Derivative Functional Analogue (NDFA): SDFA not an obvious substructure derivative
5. Empirical Examples and Effectiveness
Concrete applications to several bioactive and dye molecules demonstrate the method's utility:
| Query Molecule | Canonicalization* | Example SDFA (hit) | Tanimoto Coefficient |
|---|---|---|---|
| Penicillin G | RDKit Atom n | metalloprotease inhibitor (CID 59069194) | 0.41 |
| Penicillin G | OEChem | MKNK inhibitor (non-lactam) | 0.45 |
| Nirmatrelvir | OEChem | RSV inhibitor (CID 137371163) | 0.35 |
| LSD | OEChem | dihydroergotamine analogue (CID 18634859) | 0.39 |
| Acid Blue 25 FA | OEChem | porphyrins (CID 20314677) | 0.52 |
| Avobenzone | OEChem | Ir(III) OLED dopant (CID 123262979) | 0.08 |
| 2-diphenylaminocarbazole | OEChem | OLED scaffold (CID 91515383) | 0.13 |
*Canonicalization = query SMILES variant used as prompt
A salient result is that OEChem prompts consistently surface functionally validated analogues with low structural similarity, as corroborated by patent or scientific literature; e.g., Penicillin G's OEChem prompt retrieves a protease inhibitor with Tanimoto = 0.45 (Kosonocky et al., 2023).
6. Quantitative Evaluation and Statistical Significance
Evaluation across eight queries and top 20 hits per query yields the following metrics (mean ± 95% CI):
| Query Prompt | Mean Tanimoto | SDFA per 20 hits | NDFA per 20 hits |
|---|---|---|---|
| RDKit Atom 0 | 0.62 ± 0.03 | 1.1 ± 0.6 | 0.6 ± 0.4 |
| RDKit Atom n | 0.45 ± 0.04 | 2.8 ± 1.0 | 1.8 ± 0.7 |
| OEChem | 0.32 ± 0.05 | 4.4 ± 1.2 | 3.6 ± 1.0 |
Statistical testing (two-sided t-tests) confirms significant increases in SDFA and NDFA rates for OEChem versus RDKit Atom 0 (p < ). Functional annotations are supported by independent patent and literature evidence for each top hit.
7. Implications and Practical Guidance
Prompt-based analogue discovery leverages the differential encoding induced by alternative SMILES canonicalization of queries, while maintaining a fixed database embedding set. This process forces the chemical LLM to utilize deeper, functionally relevant chemical features beyond token-level SMILES representation, thus identifying structurally distinct scaffolds with preserved function. Researchers aiming to discover IAT analogues should:
- Replace the query SMILES with that of IAT.
- Generate OEChem and RDKit Atom n canonical forms.
- Embed each, compute cosine similarity against the prepared database (RDKit Atom 0).
- Inspect the top hits' patent/literature records for functional validation.
This approach facilitates the identification of novel, non-obvious functional analogues unattainable via conventional fingerprint or SMILES string comparison methods (Kosonocky et al., 2023).