Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SALSA: Semantically-Aware Latent Space Autoencoder (2310.02744v1)

Published 4 Oct 2023 in cs.LG

Abstract: In deep learning for drug discovery, chemical data are often represented as simplified molecular-input line-entry system (SMILES) sequences which allow for straightforward implementation of natural language processing methodologies, one being the sequence-to-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are defined by the structural (graph-to-graph) similarities between molecules. We demonstrate by example that autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not respect the structural similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA), a transformer-autoencoder modified with a contrastive task, tailored specifically to learn graph-to-graph similarity between molecules. Formally, the contrastive objective is to map structurally similar molecules (separated by a single graph edit) to nearby codes in the latent space. To accomplish this, we generate a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We compare SALSA to its ablated counterparts, and show empirically that the composed training objective (reconstruction and contrastive task) leads to a higher quality latent space that is more 1) structurally-aware, 2) semantically continuous, and 3) property-aware.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. All SMILES Variational Autoencoder. arXiv:1905.13343.
  2. The ChEMBL bioactivity database: an update. Nucleic Acids Res., 42(Database issue): D1083–90.
  3. Generative models for molecular discovery: Recent advances and challenges. Wiley Interdiscip. Rev. Comput. Mol. Sci., 12(5).
  4. Generating Sentences from a Continuous Space. (arXiv:1511.06349). ArXiv:1511.06349 [cs].
  5. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction.
  6. Feature Selection for the Interpretation of Antioxidant Mechanisms in Plant Phenolics. Molecules, 28(3).
  7. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science, 4(2): 268–276. PMID: 29532027.
  8. Quantitative Methods in System-Based Drug Discovery. Chapters.
  9. SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. arXiv:1911.04738.
  10. Transformer VAE: A Hierarchical Model for Structure-Aware and Interpretable Music Representation Learning. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 516–520.
  11. Supervised Contrastive Learning.
  12. Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4678–4699. Online: Association for Computational Linguistics.
  13. Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 104–113. Sofia, Bulgaria.
  14. Molecular Similarity in Medicinal Chemistry. Journal of Medicinal Chemistry, 57(8): 3186–3204. PMID: 24151987.
  15. Adversarial Autoencoders. CoRR, abs/1511.05644.
  16. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.
  17. Sentence Bottleneck Autoencoders from Transformer Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 1822–1831. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
  18. Deep reinforcement learning for de novo drug design. Sci Adv, 4(7): eaap7885.
  19. RDKit. 2023. RDKit: Open-source cheminformatics.
  20. Educating text autoencoders: Latent representation guidance via denoising. In International conference on machine learning, 8719–8729. PMLR.
  21. FragNet, a Contrastive Learning-Based Transformer Model for Clustering, Interpreting, Visualizing, and Navigating Chemical Space. Molecules, 26(7).
  22. Sohn, K. 2016. Improved deep metric learning with multi-class N-pair loss objective. In Advances in neural information processing systems, 1857–1865.
  23. Attention is all you need.
  24. Attention is All you Need. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  25. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, 9929–9939. PMLR.
  26. MolCLR: Molecular Contrastive Learning of Representations via Graph Neural Networks.
  27. Fsp3: A new parameter for drug-likeness. Drug Discov. Today, 25(10): 1839–1845.
  28. Weininger, D. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci., 28(1): 31–36.
  29. White, T. 2016. Sampling Generative Networks.
  30. Self-supervised graph-level representation learning with local and global structure.
Citations (1)

Summary

We haven't generated a summary for this paper yet.