A Python library for efficient computation of molecular fingerprints (2403.19718v1)
Abstract: Machine learning solutions are very popular in the field of chemoinformatics, where they have numerous applications, such as novel drug discovery or molecular property prediction. Molecular fingerprints are algorithms commonly used for vectorizing chemical molecules as a part of preprocessing in this kind of solution. However, despite their popularity, there are no libraries that implement them efficiently for large datasets, utilizing modern, multicore architectures. On top of that, most of them do not provide the user with an intuitive interface, or one that would be compatible with other machine learning tools. In this project, we created a Python library that computes molecular fingerprints efficiently and delivers an interface that is comprehensive and enables the user to easily incorporate the library into their existing machine learning workflow. The library enables the user to perform computation on large datasets using parallelism. Because of that, it is possible to perform such tasks as hyperparameter tuning in a reasonable time. We describe tools used in implementation of the library and asses its time performance on example benchmark datasets. Additionally, we show that using molecular fingerprints we can achieve results comparable to state-of-the-art ML solutions even with very simple models.
- “Development of Natural Compound Molecular Fingerprint (NC-MFP) with the Dictionary of Natural Products (DNP) for natural product-based drug development” In Journal of Cheminformatics 12.1 BioMed Central, 2020, pp. 1–17
- “Reoptimization of MDL keys for use in drug discovery” In Journal of chemical information and computer sciences 42.6 ACS Publications, 2002, pp. 1273–1280
- “An assessment of the structural resolution of various fingerprints commonly used in machine learning” In Machine Learning: Science and Technology 2.1 IOP Publishing, 2021, pp. 015018
- “Concepts and applications of chemical fingerprint for hit and lead screening” In Drug Discovery Today Elsevier, 2022, pp. 103356
- “Scikit-learn: Machine learning in Python” In the Journal of machine Learning research 12 JMLR. org, 2011, pp. 2825–2830
- Joblib Development Team “Joblib: Embarrassingly parallel for loops”, 2020 URL: https://joblib.readthedocs.io/en/latest/parallel.html
- “Extended-connectivity fingerprints” In Journal of chemical information and modeling 50.5 ACS Publications, 2010, pp. 742–754
- Jan Jelínek, Petr Škoda and David Hoksza “Utilizing knowledge base of amino acids structural neighborhoods to predict protein-protein interaction sites” In BMC bioinformatics 18.15 BioMed Central, 2017, pp. 63–72
- “GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph” In Advances in Neural Information Processing Systems 34 Curran Associates, Inc., 2021, pp. 28798–28810
- “Evaluating molecular representations in machine learning models for drug response prediction and interpretability” In Journal of Integrative Bioinformatics 19.3 De Gruyter, 2022, pp. 20220006
- “A fingerprints based molecular property prediction method using the BERT model” In Journal of Cheminformatics 14.1 BioMed Central, 2022, pp. 1–13
- “MoleculeNet: a benchmark for molecular machine learning” In Chemical science 9.2 Royal Society of Chemistry, 2018, pp. 513–530
- “Comparative analysis of molecular fingerprints in prediction of drug combination effects” In Briefings in bioinformatics 22.6 Oxford University Press, 2021, pp. bbab291
- “A comprehensive comparison of molecular feature representations for use in predictive modeling” In Computers in Biology and Medicine 130 Elsevier, 2021, pp. 104197
- “Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development” In arXiv preprint arXiv:2102.09548, 2021
- G. Landrum “RDKit: Open-source cheminformatics.”, 2010 URL: https://www.rdkit.org
- Lewis Richard “Scikit-chem: simple cheminformatics for Python”, 2016 URL: https://github.com/lewisacidic/scikit-chem
- “Array programming with NumPy” In Nature 585.7825 Nature Publishing Group UK London, 2020, pp. 357–362
- The team “pandas-dev/pandas: Pandas” Zenodo, 2020 DOI: 10.5281/zenodo.3509134
- Esben Jannik Bjerrum “Scikit-Learn classes for molecular vectorization using RDKit”, 2022 URL: https://github.com/EBjerrum/scikit-mol
- Gael Varoquaux “joblib” URL: https://github.com/joblib/joblib
- “SciPy 1.0: fundamental algorithms for scientific computing in Python” In Nature methods 17.3 Nature Publishing Group, 2020, pp. 261–272
- “Open graph benchmark: Datasets for machine learning on graphs” In Advances in neural information processing systems 33, 2020, pp. 22118–22133
- John J Delany Yosef Taitz “Daylight Chemical Information Systems, Inc., Daylight Theory: Fingerprints” URL: https://www.daylight.com/dayhtml/doc/theory/theory.finger.html
- Saiveth Hernández-Hernández and Pedro J Ballester “On the Best Way to Cluster NCI-60 Molecules” In Biomolecules 13.3 MDPI, 2023, pp. 498
- David Weininger “SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules” In Journal of chemical information and computer sciences 28.1 ACS Publications, 1988, pp. 31–36
- “Molecular representations in AI-driven drug discovery: a review and practical guide” In Journal of Cheminformatics 12.1 BioMed Central, 2020, pp. 1–22
- “Cheminformatics for accelerated design of chemical admixtures” In Cement and Concrete Research 136 Elsevier, 2020, pp. 106173
- “Molecular trapping in two-dimensional chiral organic Kagomé nanoarchitectures composed of Baravelle spiral triangle enantiomers” In NPG Asia Materials 12.1 Nature Publishing Group UK London, 2020, pp. 20
- “Organic chemistry concepts: An EFL approach” Academic Press, 2014
- Harry L Morgan “The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service.” In Journal of chemical documentation 5.2 ACS Publications, 1965, pp. 107–113
- “The reduction of a graph to canonical form and the algebra which appears therein” In nti, Series 2.9, 1968, pp. 12–16
- “A simple representation of three-dimensional molecular structure” In Journal of medicinal chemistry 60.17 ACS Publications, 2017, pp. 7393–7409
- “ErG: 2D pharmacophore descriptions for scaffold hopping” In Journal of chemical information and modeling 46.1 ACS Publications, 2006, pp. 208–220
- “Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach” In Journal of Cheminformatics 13 Springer, 2021, pp. 1–12
- G. Landlum “MACCS keys implementation in RDKit”, 2010 URL: https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MACCSkeys.py
- Alice Capecchi, Daniel Probst and Jean-Louis Reymond “One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome” In Journal of cheminformatics 12.1 BioMed Central, 2020, pp. 1–15
- “A probabilistic molecular fingerprint for big data settings” In Journal of cheminformatics 10 Springer, 2018, pp. 1–12
- “Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors” In Journal of Chemical Information and Computer Sciences 27.2 ACS Publications, 1987, pp. 82–85
- “TFD: torsion fingerprints as a new measure to compare small molecule conformations” In Journal of chemical information and modeling 52.6 ACS Publications, 2012, pp. 1499–1512
- Langa “The Uncompromising Code Formatter”, 2019 URL: https://github.com/psf/black
- Timothy Crosley “isort your imports, so you don’t have to”, 20117 URL: https://github.com/pycqa/isort
- Guido Rossum, Barry Warsaw and Nick Coghlan “Style Guide for Python Code”, 2001 URL: https://www.python.org/dev/peps/pep-0008/
- Michele Lacchia “xenon”, 2014 URL: https://github.com/rubik/xenon/tree/master
- PyCQA “A security linter from PyCQA”, 2014 URL: https://github.com/PyCQA/bandit
- “Semantic Versioning” URL: https://semver.org/
- Michał Szafarczyk, Piotr Ludynia and Przemysław Kukla “scikit-fingerprints” URL: https://github.com/Arch4ngel21/scikit-fingerprints
- “MIT License” URL: https://opensource.org/license/mit/
- Michele Lacchia “radon”, 2012 URL: https://github.com/rubik/radon
- Tarek Ziade “Flake8”, 2010 URL: https://github.com/pycqa/flake8
- “A novel HIV-1 antiviral high throughput screening approach for the discovery of HIV-1 inhibitors” In Antiviral research 65.2 Elsevier, 2005, pp. 107–116
- “Computational modeling of β𝛽\betaitalic_β-secretase 1 (BACE-1) inhibitors using ligand based approaches” In Journal of chemical information and modeling 56.10 ACS Publications, 2016, pp. 1936–1949
- “A Bayesian approach to in silico blood-brain barrier penetration modeling” In Journal of chemical information and modeling 52.6 ACS Publications, 2012, pp. 1686–1697
- Guy W Bemis and Mark A Murcko “The properties of known drugs. 1. Molecular frameworks” In Journal of medicinal chemistry 39.15 ACS Publications, 1996, pp. 2887–2893
- David R Cox “The regression analysis of binary sequences” In Journal of the Royal Statistical Society Series B: Statistical Methodology 20.2 Oxford University Press, 1958, pp. 215–232
- Tin Kam Ho “Random decision forests” In Proceedings of 3rd international conference on document analysis and recognition 1, 1995, pp. 278–282 IEEE
- “Lightgbm: A highly efficient gradient boosting decision tree” In Advances in neural information processing systems 30, 2017
- “OGB leaderboards for Graph Property Prediction” URL: https://ogb.stanford.edu/docs/leader_graphprop/
- “Improving graph neural network expressivity via subgraph isomorphism counting” In IEEE Transactions on Pattern Analysis and Machine Intelligence 45.1 IEEE, 2022, pp. 657–668
- “Directional graph networks” In International Conference on Machine Learning, 2021, pp. 748–758 PMLR
- “Robust optimization as data augmentation for large-scale graphs” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 60–69
- “Parameterized hypercomplex graph neural networks for graph classification” In International Conference on Artificial Neural Networks, 2021, pp. 204–216 Springer
- “Principal neighbourhood aggregation for graph nets” In Advances in Neural Information Processing Systems 33, 2020, pp. 13260–13271
- “Strategies for Pre-training Graph Neural Networks”, 2019
- “Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations” In Chemical science 10.6 Royal Society of Chemistry, 2019, pp. 1692–1701
- Sabrina Jaeger, Simone Fulle and Samo Turk “Mol2vec: unsupervised machine learning approach with chemical intuition” In Journal of chemical information and modeling 58.1 ACS Publications, 2018, pp. 27–35