Understanding the Structure of QM7b and QM9 Quantum Mechanical Datasets Using Unsupervised Learning (2309.15130v1)
Abstract: This paper explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from the properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner core region that concentrates clustered, inliner objects. A significant relationship exists between the number of atoms in the molecule and its outlier/inner nature. Despite the structural differences, the predictability of variables of interest for inverse molecular design is high. This is exemplified with models estimating the number of atoms of the molecule from both the original properties, and from lower dimensional embedding spaces.
- Estimating the effective dimension of large biological datasets using Fisher separability analysis. In International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2019.
- Extreme-value-theoretic estimation of local intrinsic dimensionality. Data Mining and Knowledge Discovery, 32:1768–1805, 2018.
- Intrinsic dimensionality estimation within tight localities. In SIAM International Conference on Data Mining, pages 181–189. SIAM, 2019.
- Scikit-dimension: A python package for intrinsic dimension estimation. Entropy, 23:1368, 2021.
- Hierarchical Modeling and Analysis for Spatial Data. Chapman and Hall/CRC Press, Taylor and Francis Group, 2004.
- Application of generative autoencoder in de novo molecular design. Molecular Informatics, 37(1-2):1700123, 2018.
- 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc., 131:8732, 2009.
- Lof: identifying density-based local outliers. in acm sigmod record, volume 29, 93–104. acm, 2000. ACM sigmod record, 29:93–104, 2000.
- Danco: An intrinsic dimensionality estimator exploiting angle and norm concentration. Pattern Recognition, 47:2569–2581, 2014.
- Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7:12140, 2017.
- Intrinsic dimension estimation of data by principal component analysis. arXiv preprint arXiv:1002.2050, 2010.
- K. Fukunaga and D.R. Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, C-20(2):176–183, 1971.
- M. Goldstein and A. Dengel. Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm. In Proc. KI-2012 Poster and Demo Track, page 59–63, 2012.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci, 4(2):268–276, 2018.
- Correction of ai systems by linear discriminants: Probabilistic foundations. Information Sciences, 466:303–322, 2018.
- P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. Physica D: Nonlinear Phenomena, 9(3):189–208, 1983.
- Discovering cluster-based local outliers. Pattern Recognition Letters, 24((9-10)):1641–1650, 2003.
- W.B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26:189––206, 1984.
- Deep generative models for molecular science. Mol. Inf., 37:1700133, 2018.
- drugan: An advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular Pharmaceutics, 14(9):3098–3104, 2017. PMID: 28703000.
- Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2019. arXiv:1902.05804.
- E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic dimension. In International Conference on Neural Information Processing Systems, pages 777–784. MIT Press, 2004.
- Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods, 16:243, 2019.
- Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, pages 413–422, 2008.
- G. Matheron. Principles of geostatistics. Economic Geology, 58:1246–1266, 1963.
- UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv e-prints, February 2018.
- Umap: Uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861, 2018.
- Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics, 15(9):095003, 2013.
- On spectral clustering: Analysis and an algorithm. In ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, pages 849–856. MIT Press, 2001.
- T. Pevnỳ. Loda: lightweight on-line detector of anomalies. Machine Learning, 102(2):275–304, 2016.
- quantum machine.org, accessed Dec. 2022.
- Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014.
- Tree-SNE: Hierarchical clustering and visualization using t-SNE. arXiv, 2002(05687v1), Feb 2020.
- Novel high intrinsic dimensionality estimators. Machine Learning, 89:37–65, 2012.
- Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of Chemical Information and Modeling, 52:2864–2875, 2012.
- M. Rupp. Machine learning for quantum mechanics in a nutshell. International Journal of Quantum Chemistry, 115(16):1058–1073, 2015.
- Towards a classification scheme for inferring the atomic composition of drug-like molecules from their quantum derived electronic properties. In Proc. IEEE 19th IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (IEEE CIBCB 2022 ), Ottawa, Canada, August 15-17 2022. IEEE.
- Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
- Suod: Accelerating large-scale unsupervised heterogeneous outlier detection. In A. Smola, A. Dimakis, and I. Stoica, editors, Proceedings of Machine Learning and Systems, volume 3, pages 463–478, 2021.
- Lscp: locally selective combination in parallel outlier ensembles. In Proc. 2019 SIAM International Conference on Data Mining, SDM 2019, page 585–593. SIAM, 2019.