Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
103 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

A Python library for efficient computation of molecular fingerprints (2403.19718v1)

Published 27 Mar 2024 in q-bio.QM and cs.LG

Abstract: Machine learning solutions are very popular in the field of chemoinformatics, where they have numerous applications, such as novel drug discovery or molecular property prediction. Molecular fingerprints are algorithms commonly used for vectorizing chemical molecules as a part of preprocessing in this kind of solution. However, despite their popularity, there are no libraries that implement them efficiently for large datasets, utilizing modern, multicore architectures. On top of that, most of them do not provide the user with an intuitive interface, or one that would be compatible with other machine learning tools. In this project, we created a Python library that computes molecular fingerprints efficiently and delivers an interface that is comprehensive and enables the user to easily incorporate the library into their existing machine learning workflow. The library enables the user to perform computation on large datasets using parallelism. Because of that, it is possible to perform such tasks as hyperparameter tuning in a reasonable time. We describe tools used in implementation of the library and asses its time performance on example benchmark datasets. Additionally, we show that using molecular fingerprints we can achieve results comparable to state-of-the-art ML solutions even with very simple models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. “Development of Natural Compound Molecular Fingerprint (NC-MFP) with the Dictionary of Natural Products (DNP) for natural product-based drug development” In Journal of Cheminformatics 12.1 BioMed Central, 2020, pp. 1–17
  2. “Reoptimization of MDL keys for use in drug discovery” In Journal of chemical information and computer sciences 42.6 ACS Publications, 2002, pp. 1273–1280
  3. “An assessment of the structural resolution of various fingerprints commonly used in machine learning” In Machine Learning: Science and Technology 2.1 IOP Publishing, 2021, pp. 015018
  4. “Concepts and applications of chemical fingerprint for hit and lead screening” In Drug Discovery Today Elsevier, 2022, pp. 103356
  5. “Scikit-learn: Machine learning in Python” In the Journal of machine Learning research 12 JMLR. org, 2011, pp. 2825–2830
  6. Joblib Development Team “Joblib: Embarrassingly parallel for loops”, 2020 URL: https://joblib.readthedocs.io/en/latest/parallel.html
  7. “Extended-connectivity fingerprints” In Journal of chemical information and modeling 50.5 ACS Publications, 2010, pp. 742–754
  8. Jan Jelínek, Petr Škoda and David Hoksza “Utilizing knowledge base of amino acids structural neighborhoods to predict protein-protein interaction sites” In BMC bioinformatics 18.15 BioMed Central, 2017, pp. 63–72
  9. “GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph” In Advances in Neural Information Processing Systems 34 Curran Associates, Inc., 2021, pp. 28798–28810
  10. “Evaluating molecular representations in machine learning models for drug response prediction and interpretability” In Journal of Integrative Bioinformatics 19.3 De Gruyter, 2022, pp. 20220006
  11. “A fingerprints based molecular property prediction method using the BERT model” In Journal of Cheminformatics 14.1 BioMed Central, 2022, pp. 1–13
  12. “MoleculeNet: a benchmark for molecular machine learning” In Chemical science 9.2 Royal Society of Chemistry, 2018, pp. 513–530
  13. “Comparative analysis of molecular fingerprints in prediction of drug combination effects” In Briefings in bioinformatics 22.6 Oxford University Press, 2021, pp. bbab291
  14. “A comprehensive comparison of molecular feature representations for use in predictive modeling” In Computers in Biology and Medicine 130 Elsevier, 2021, pp. 104197
  15. “Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development” In arXiv preprint arXiv:2102.09548, 2021
  16. G. Landrum “RDKit: Open-source cheminformatics.”, 2010 URL: https://www.rdkit.org
  17. Lewis Richard “Scikit-chem: simple cheminformatics for Python”, 2016 URL: https://github.com/lewisacidic/scikit-chem
  18. “Array programming with NumPy” In Nature 585.7825 Nature Publishing Group UK London, 2020, pp. 357–362
  19. The team “pandas-dev/pandas: Pandas” Zenodo, 2020 DOI: 10.5281/zenodo.3509134
  20. Esben Jannik Bjerrum “Scikit-Learn classes for molecular vectorization using RDKit”, 2022 URL: https://github.com/EBjerrum/scikit-mol
  21. Gael Varoquaux “joblib” URL: https://github.com/joblib/joblib
  22. “SciPy 1.0: fundamental algorithms for scientific computing in Python” In Nature methods 17.3 Nature Publishing Group, 2020, pp. 261–272
  23. “Open graph benchmark: Datasets for machine learning on graphs” In Advances in neural information processing systems 33, 2020, pp. 22118–22133
  24. John J Delany Yosef Taitz “Daylight Chemical Information Systems, Inc., Daylight Theory: Fingerprints” URL: https://www.daylight.com/dayhtml/doc/theory/theory.finger.html
  25. Saiveth Hernández-Hernández and Pedro J Ballester “On the Best Way to Cluster NCI-60 Molecules” In Biomolecules 13.3 MDPI, 2023, pp. 498
  26. David Weininger “SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules” In Journal of chemical information and computer sciences 28.1 ACS Publications, 1988, pp. 31–36
  27. “Molecular representations in AI-driven drug discovery: a review and practical guide” In Journal of Cheminformatics 12.1 BioMed Central, 2020, pp. 1–22
  28. “Cheminformatics for accelerated design of chemical admixtures” In Cement and Concrete Research 136 Elsevier, 2020, pp. 106173
  29. “Molecular trapping in two-dimensional chiral organic Kagomé nanoarchitectures composed of Baravelle spiral triangle enantiomers” In NPG Asia Materials 12.1 Nature Publishing Group UK London, 2020, pp. 20
  30. “Organic chemistry concepts: An EFL approach” Academic Press, 2014
  31. Harry L Morgan “The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service.” In Journal of chemical documentation 5.2 ACS Publications, 1965, pp. 107–113
  32. “The reduction of a graph to canonical form and the algebra which appears therein” In nti, Series 2.9, 1968, pp. 12–16
  33. “A simple representation of three-dimensional molecular structure” In Journal of medicinal chemistry 60.17 ACS Publications, 2017, pp. 7393–7409
  34. “ErG: 2D pharmacophore descriptions for scaffold hopping” In Journal of chemical information and modeling 46.1 ACS Publications, 2006, pp. 208–220
  35. “Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach” In Journal of Cheminformatics 13 Springer, 2021, pp. 1–12
  36. G. Landlum “MACCS keys implementation in RDKit”, 2010 URL: https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MACCSkeys.py
  37. Alice Capecchi, Daniel Probst and Jean-Louis Reymond “One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome” In Journal of cheminformatics 12.1 BioMed Central, 2020, pp. 1–15
  38. “A probabilistic molecular fingerprint for big data settings” In Journal of cheminformatics 10 Springer, 2018, pp. 1–12
  39. “Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors” In Journal of Chemical Information and Computer Sciences 27.2 ACS Publications, 1987, pp. 82–85
  40. “TFD: torsion fingerprints as a new measure to compare small molecule conformations” In Journal of chemical information and modeling 52.6 ACS Publications, 2012, pp. 1499–1512
  41. Langa “The Uncompromising Code Formatter”, 2019 URL: https://github.com/psf/black
  42. Timothy Crosley “isort your imports, so you don’t have to”, 20117 URL: https://github.com/pycqa/isort
  43. Guido Rossum, Barry Warsaw and Nick Coghlan “Style Guide for Python Code”, 2001 URL: https://www.python.org/dev/peps/pep-0008/
  44. Michele Lacchia “xenon”, 2014 URL: https://github.com/rubik/xenon/tree/master
  45. PyCQA “A security linter from PyCQA”, 2014 URL: https://github.com/PyCQA/bandit
  46. “Semantic Versioning” URL: https://semver.org/
  47. Michał Szafarczyk, Piotr Ludynia and Przemysław Kukla “scikit-fingerprints” URL: https://github.com/Arch4ngel21/scikit-fingerprints
  48. “MIT License” URL: https://opensource.org/license/mit/
  49. Michele Lacchia “radon”, 2012 URL: https://github.com/rubik/radon
  50. Tarek Ziade “Flake8”, 2010 URL: https://github.com/pycqa/flake8
  51. “A novel HIV-1 antiviral high throughput screening approach for the discovery of HIV-1 inhibitors” In Antiviral research 65.2 Elsevier, 2005, pp. 107–116
  52. “Computational modeling of β𝛽\betaitalic_β-secretase 1 (BACE-1) inhibitors using ligand based approaches” In Journal of chemical information and modeling 56.10 ACS Publications, 2016, pp. 1936–1949
  53. “A Bayesian approach to in silico blood-brain barrier penetration modeling” In Journal of chemical information and modeling 52.6 ACS Publications, 2012, pp. 1686–1697
  54. Guy W Bemis and Mark A Murcko “The properties of known drugs. 1. Molecular frameworks” In Journal of medicinal chemistry 39.15 ACS Publications, 1996, pp. 2887–2893
  55. David R Cox “The regression analysis of binary sequences” In Journal of the Royal Statistical Society Series B: Statistical Methodology 20.2 Oxford University Press, 1958, pp. 215–232
  56. Tin Kam Ho “Random decision forests” In Proceedings of 3rd international conference on document analysis and recognition 1, 1995, pp. 278–282 IEEE
  57. “Lightgbm: A highly efficient gradient boosting decision tree” In Advances in neural information processing systems 30, 2017
  58. “OGB leaderboards for Graph Property Prediction” URL: https://ogb.stanford.edu/docs/leader_graphprop/
  59. “Improving graph neural network expressivity via subgraph isomorphism counting” In IEEE Transactions on Pattern Analysis and Machine Intelligence 45.1 IEEE, 2022, pp. 657–668
  60. “Directional graph networks” In International Conference on Machine Learning, 2021, pp. 748–758 PMLR
  61. “Robust optimization as data augmentation for large-scale graphs” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 60–69
  62. “Parameterized hypercomplex graph neural networks for graph classification” In International Conference on Artificial Neural Networks, 2021, pp. 204–216 Springer
  63. “Principal neighbourhood aggregation for graph nets” In Advances in Neural Information Processing Systems 33, 2020, pp. 13260–13271
  64. “Strategies for Pre-training Graph Neural Networks”, 2019
  65. “Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations” In Chemical science 10.6 Royal Society of Chemistry, 2019, pp. 1692–1701
  66. Sabrina Jaeger, Simone Fulle and Samo Turk “Mol2vec: unsupervised machine learning approach with chemical intuition” In Journal of chemical information and modeling 58.1 ACS Publications, 2018, pp. 27–35

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com