Unsupervised Learning via Network-Aware Embeddings (2309.10408v1)
Abstract: Data clustering, the task of grouping observations according to their similarity, is a key component of unsupervised learning -- with real world applications in diverse fields such as biology, medicine, and social science. Often in these fields the data comes with complex interdependencies between the dimensions of analysis, for instance the various characteristics and opinions people can have live on a complex social network. Current clustering methods are ill-suited to tackle this complexity: deep learning can approximate these dependencies, but not take their explicit map as the input of the analysis. In this paper, we aim at fixing this blind spot in the unsupervised learning literature. We can create network-aware embeddings by estimating the network distance between numeric node attributes via the generalized Euclidean distance. Differently from all methods in the literature that we know of, we do not cluster the nodes of the network, but rather its node attributes. In our experiments we show that having these network embeddings is always beneficial for the learning task; that our method scales to large networks; and that we can actually provide actionable insights in applications in a variety of fields such as marketing, economics, and political science. Our method is fully open source and data and code are available to reproduce all results in the paper.
- Charu C Aggarwal et al. Neural networks and deep learning. Springer, 10(978):3, 2018.
- Clustering with deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648, 2018.
- Optics: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2):49–60, 1999.
- The Growth Lab at Harvard University. International Trade Data (SITC, Rev. 2), 2019. URL https://doi.org/10.7910/DVN/H8SFD2.
- Neighbors and the evolution of the comparative advantage of nations: Evidence of international knowledge diffusion? Journal of International Economics, 92(1):111–123, 2014.
- Spectral clustering with graph neural networks for graph pooling. In International conference on machine learning, pp. 874–883. PMLR, 2020.
- Structural deep clustering network. In Proceedings of the web conference 2020, pp. 1400–1410, 2020.
- The structure and dynamics of multilayer networks. Physics reports, 544(1):1–122, 2014.
- Clustering attributed graphs: models, measures and methods. Network Science, 3(3):408–444, 2015.
- Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
- Unsupervised deep clustering via contractive feature representation and focal loss. Pattern Recognition, 123:108386, 2022.
- Multi-view attribute graph convolution networks for clustering. In Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pp. 2973–2979, 2021.
- Petr Chunaev. Community detection in node-attributed social networks: a survey. Computer Science Review, 37:100286, 2020.
- Michele Coscia. Generalized euclidean measure to estimate network distances. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pp. 119–129, 2020.
- Michele Coscia. The atlas for the aspiring network scientist. arXiv preprint arXiv:2101.00863, 2021a.
- Michele Coscia. Pearson correlations on complex networks. Journal of Complex Networks, 9(6):cnab036, 2021b.
- Michele Coscia. Generalized euclidean measure to estimate distances on multilayer networks. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(6):1–22, 2022.
- Network backboning with noisy data. In 2017 IEEE 33rd international conference on data engineering (ICDE), pp. 425–436. IEEE, 2017.
- The node vector distance problem in complex networks. ACM Computing Surveys (CSUR), 53(6):1–27, 2020.
- Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
- Variance and covariance of distributions on graphs. SIAM Review, 64(2):343–359, 2022.
- A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pp. 226–231, 1996.
- A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110:104743, 2022.
- Semi-supervised cluster analysis of imaging data. NeuroImage, 54(3):2185–2197, 2011.
- Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
- Resolution limit in community detection. Proceedings of the national academy of sciences, 104(1):36–41, 2007.
- Community detection in networks: A user guide. Physics reports, 659:1–44, 2016.
- Data clustering: theory, algorithms, and applications. SIAM, 2020.
- node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864, 2016.
- Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- The atlas of economic complexity: Mapping paths to prosperity. Mit Press, 2014.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016a.
- Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016b.
- Multilayer networks. Journal of complex networks, 2(3):203–271, 2014.
- A nearly-m log n time solver for sdd linear systems. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pp. 590–598. IEEE, 2011.
- Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
- Approximate gaussian elimination for laplacians-fast, sparse, and simple. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 573–582. IEEE, 2016.
- Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pp. 631–640, 2010.
- Multi-view attributed graph clustering. IEEE Transactions on knowledge and data engineering, 2021.
- LittleSis. Littlesis is a free database detailing the connections between powerful people and organizations, 2022. Data retrieved from https://littlesis.org/bulk_data. Last update date Nov 15th, 2022.
- PCÂ Mahalanobis. On the generalized distance in statistics. National Institute of Science of India, 1936.
- Skin cancer detection from dermoscopic images using deep learning and fuzzy k-means clustering. Microscopy research and technique, 85(1):339–351, 2022.
- Deep learning for anomaly detection: A review. ACM computing surveys (CSUR), 54(2):1–38, 2021.
- The ground truth about metadata and community detection in networks. Science advances, 3(5):e1602548, 2017.
- Focused clustering and outlier detection in large attributed graphs. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1346–1355, 2014a.
- Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710, 2014b.
- Congress: A political-economic history of roll call voting. Oxford University Press, USA, 2000.
- Mason A Porter. What is… a multilayer network. Notices of the AMS, 65(11):1419–1423, 2018.
- High-throughput genotyping with single nucleotide polymorphisms. Genome Research, 11(7):1262–1268, 2001.
- Community discovery in dynamic networks: a survey. ACM computing surveys (CSUR), 51(2):1–37, 2018.
- The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
- Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 81–90, 2004.
- Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835–885, 2014.
- Graph clustering with graph neural networks. arXiv preprint arXiv:2006.16904, 2020.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Graph attention networks. stat, 1050(20):10–48550, 2017.
- Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, pp. 1073–1080, 2009.
- Attributed graph clustering: A deep attentional embedding approach. arXiv preprint arXiv:1906.06532, 2019.
- Graph neural networks: foundation, frontiers and applications. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 4840–4841, 2022.
- Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. PMLR, 2016.
- Community detection in networks with node attributes. In 2013 IEEE 13th international conference on data mining, pp. 1151–1156. IEEE, 2013.
- Variational co-embedding learning for attributed network clustering. Knowledge-Based Systems, 270:110530, 2023.
- Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 793–803, 2019.
- Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020a.
- Towards deeper graph neural networks with differentiable group normalization. Advances in neural information processing systems, 33:4917–4928, 2020b.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.