Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stepping Back to SMILES Transformers for Fast Molecular Representation Inference (2112.13305v1)

Published 26 Dec 2021 in cs.CE

Abstract: In the intersection of molecular science and deep learning, tasks like virtual screening have driven the need for a high-throughput molecular representation generator on large chemical databases. However, as SMILES strings are the most common storage format for molecules, using deep graph models to extract molecular feature from raw SMILES data requires an SMILES-to-graph conversion, which significantly decelerates the whole process. Directly deriving molecular representations from SMILES is feasible, yet there exists a performance gap between the existing unpretrained SMILES-based models and graph-based models at large-scale benchmark results, while pretrain models are resource-demanding at training. To address this issue, we propose ST-KD, an end-to-end \textbf{S}MILES \textbf{T}ransformer for molecular representation learning boosted by \textbf{K}nowledge \textbf{D}istillation. In order to conduct knowledge transfer from graph Transformers to ST-KD, we have redesigned the attention layers and introduced a pre-transformation step to tokenize the SMILES strings and inject structure-based positional embeddings. Without expensive pretraining, ST-KD shows competitive results on latest standard molecular datasets PCQM4M-LSC and QM9, with $3\text{-}14\times$ inference speed compared with existing graph models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wenhao Zhu (32 papers)
  2. Ziyao Li (11 papers)
  3. Lingsheng Cai (3 papers)
  4. Guojie Song (39 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.