Supervised Manifold Learning via Random Forest Geometry-Preserving Proximities (2307.01077v1)
Abstract: Manifold learning approaches seek the intrinsic, low-dimensional data structure within a high-dimensional space. Mainstream manifold learning algorithms, such as Isomap, UMAP, $t$-SNE, Diffusion Map, and Laplacian Eigenmaps do not use data labels and are thus considered unsupervised. Existing supervised extensions of these methods are limited to classification problems and fall short of uncovering meaningful embeddings due to their construction using order non-preserving, class-conditional distances. In this paper, we show the weaknesses of class-conditional manifold learning quantitatively and visually and propose an alternate choice of kernel for supervised dimensionality reduction using a data-geometry-preserving variant of random forest proximities as an initialization for manifold learning methods. We show that local structure preservation using these proximities is near universal across manifold learning approaches and global structure is properly maintained using diffusion-based algorithms.
- K. R. Moon, D. van Dijk, Z. Wang, S. Gigante, D. B. Burkhardt, W. S. Chen, K. Yim, A. v. d. Elzen, M. J. Hirn, R. R. Coifman, N. B. Ivanova, G. Wolf, and S. Krishnaswamy, “Visualizing structure and transitions in high-dimensional biological data,” Nat. Biotechnol., vol. 37, no. 12, pp. 1482–1492, Dec 2019. [Online]. Available: https://doi.org/10.1038/s41587-019-0336-3
- L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001.
- J. S. Rhodes, A. Cutler, and K. R. Moon, “Geometry- and accuracy-preserving random forest proximities,” 2022. [Online]. Available: https://arxiv.org/abs/2201.12682
- J. B. Tenenbaum, V. Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000. [Online]. Available: https://doi.org/10.1126/science.290.5500.2319
- R. R. Coifman and S. Lafon, “Diffusion maps,” Appl. Comput. Harmon. Anal., vol. 21, no. 1, pp. 5–30, 2006, special Issue: Diffusion Maps and Wavelets. [Online]. Available: https://doi.org/10.1016/j.acha.2006.04.006
- L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/vandermaaten08a.html
- M. Vlachos, C. Domeniconi, D. Gunopulos, G. Kollios et al., “Non-linear dimensionality reduction techniques for classification and visualization,” in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’02. New York, NY, USA: Association for Computing Machinery, 2002, p. 645–651. [Online]. Available: https://doi.org/10.1145/775047.775143
- D. de Ridder, O. Kouropteva, O. Okun, M. Pietikäinen, and R. P. W. Duin, “Supervised locally linear embedding,” in Artificial Neural Networks and Neural Information Processing — ICANN/ICONIP 2003, O. Kaynak, E. Alpaydin, E. Oja, and L. Xu, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 333–341.
- B. Ribeiro, A. Vieira, and J. Carvalho das Neves, “Supervised isomap with dissimilarity measures in embedding learning,” in Progress in Pattern Recognition, Image Analysis and Applications, J. Ruiz-Shulcloper and W. G. Kropatsch, Eds. Berlin, Heidelberg: Springer, 2008, pp. 389–396. [Online]. Available: https://doi.org/10.1007/978-3-540-85920-8_48
- L. Hajderanj, I. Weheliye, and D. Chen, “A new supervised T-SNE with dissimilarity measure for effective data visualization and classification,” in Proceedings of the 2019 8th International Conference on Software and Information Engineering, ser. ICSIE ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 232–236. [Online]. Available: https://doi.org/10.1145/3328833.3328853
- S. Zhang, “Enhanced supervised locally linear embedding,” Pattern Recognit. Lett, vol. 30, no. 13, pp. 1208 – 1218, 2009. [Online]. Available: https://doi.org/10.1016/j.patrec.2009.05.011
- Q. Jiang and M. Jia, “Supervised laplacian eigenmaps for machinery fault classification,” 2009 WRI World Congress on Computer Science and Information Engineering, vol. 7, pp. 116–120, 2009. [Online]. Available: http://doi.org/10.1109/CSIE.2009.765
- L. Hajderanj, D. Chen, and I. Weheliye, “The impact of supervised manifold learning on structure preserving and classification error: A theoretical study,” IEEE Access, vol. 9, pp. 43 909–43 922, 2021. [Online]. Available: 10.1109/ACCESS.2021.3066259
- M. Bohanec and V. Rajkovič, “V.: Knowledge acquisition and explanation for multi-attribute decision,” in Making, 8 th International Workshop “Expert Systems and Their Applications, 1988.
- L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv, vol. abs/1802.03426, 2018. [Online]. Available: https://arxiv.org/abs/1802.03426
- L. McInnes, J. Healy, N. Saul, and L. Grossberger, “Umap: Uniform manifold approximation and projection,” The Journal of Open Source Software, vol. 3, no. 29, p. 861, 2018.
- L. Breiman and A. Cutler, “Random forests,” https://www.stat.berkeley.edu/ breiman/RandomForests/cc_home.htm, accessed: 03/02/2023.
- T. Hastie, R. Tibshirani, and J. Friedman, “Random forests,” in The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY: Springer New York, 2009, pp. 587–604. [Online]. Available: https://doi.org/10.1007/978-0-387-84858-7_15
- Y. Lin and Y. Jeon, “Random forests and adaptive nearest neighbors,” J. Am. Stat. Assoc., vol. 101, no. 474, pp. 578–590, 2006. [Online]. Available: https://doi.org/10.1198/016214505000001230
- N. Mohamed Amine Mairech, “Datac’ept : Life expectancy prediction,” 2019. [Online]. Available: https://kaggle.com/competitions/datacept-life-expectancy-prediction
- J. S. Rhodes, A. Cutler, G. Wolf, and K. R. Moon, “Random forest-based diffusion information geometry for supervised visualization and data exploration,” 2021 IEEE Statistical Signal Processing Workshop (SSP), pp. 331–335, 2021. [Online]. Available: https://doi.org/10.1109/SSP49050.2021.9513749
- D. Dua and C. Graff, “UCI machine learning repository,” 2017, (Accessed on 03/02/2023). [Online]. Available: http://archive.ics.uci.edu/ml
- R. Gorman and T. J. Sejnowski, “Analysis of hidden units in a layered network trained to classify sonar targets,” Neural Netw, vol. 1, no. 1, pp. 75–89, 1988. [Online]. Available: https://doi.org/10.1016/0893-6080(88)90023-8