Scalable Community Search with Accuracy Guarantee on Attributed Graphs (2402.17242v3)
Abstract: Given an attributed graph $G$ and a query node $q$, \underline{C}ommunity \underline{S}earch over \underline{A}ttributed \underline{G}raphs (CS-AG) aims to find a structure- and attribute-cohesive subgraph from $G$ that contains $q$. Although CS-AG has been widely studied, they still face three challenges. (1) Exact methods based on graph traversal are time-consuming, especially for large graphs. Some tailored indices can improve efficiency, but introduce nonnegligible storage and maintenance overhead. (2) Approximate methods with a loose approximation ratio only provide a coarse-grained evaluation of a community's quality, rather than a reliable evaluation with an accuracy guarantee in runtime. (3) Attribute cohesiveness metrics often ignores the important correlation with the query node $q$. We formally define our CS-AG problem atop a $q$-centric attribute cohesiveness metric considering both textual and numerical attributes, for $k$-core model on homogeneous graphs. We show the problem is NP-hard. To solve it, we first propose an exact baseline with three pruning strategies. Then, we propose an index-free sampling-estimation-based method to quickly return an approximate community with an accuracy guarantee, in the form of a confidence interval. Once a good result satisfying a user-desired error bound is reached, we terminate it early. We extend it to heterogeneous graphs, $k$-truss model, and size-bounded CS. Comprehensive experimental studies on ten real-world datasets show its superiority, e.g., at least 1.54$\times$ (41.1$\times$ on average) faster in response time and a reliable relative error (within a user-specific error bound) of attribute cohesiveness is achieved.
- Z. Zhang, X. Huang, J. Xu, B. Choi, and Z. Shang, “Keyword-centric community search,” in ICDE, 2019, pp. 422–433.
- Y. Fang, R. Cheng, X. Li, S. Luo, and J. Hu, “Effective Community Search over Large Spatial Graphs,” PVLDB, vol. 10, no. 6, pp. 709–720, 2017.
- X. Huang and L. V. S. Lakshmanan, “Attribute-Driven Community Search,” PVLDB, vol. 10, no. 9, pp. 949–960, 2017.
- L. Sun, X. Huang, R. Li, B. Choi, and J. Xu, “Index-based intimate-core community search in large weighted graphs,” IEEE Trans. Knowl. Data Eng., 2020.
- Q. Liu, Y. Zhu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “VAC: vertex-centric attributed community search,” in ICDE, 2020, pp. 937–948.
- X. Miao, Y. Liu, L. Chen, Y. Gao, and J. Yin, “Reliable community search on uncertain graphs,” in ICDE, 2022, pp. 1166–1179.
- Y. Fang, Y. Yang, W. Zhang, X. Lin, and X. Cao, “Effective and efficient community search over large heterogeneous information networks,” PVLDB, vol. 13, no. 6, pp. 854–867, 2020.
- M. Sozio and A. Gionis, “The community-search problem and how to plan a successful cocktail party,” in KDD, 2010, pp. 939–948.
- J. Dudley, T. Deshpande, and A. J. Butte, “Exploiting drug-disease relationships for computational drug repositioning,” Briefings Bioinform., vol. 12, no. 4, pp. 303–311, 2011.
- P. Pesantez-Cabrera and A. Kalyanaraman, “Efficient detection of communities in biological bipartite networks,” IEEE ACM Trans. Comput. Biol. Bioinform., vol. 16, no. 1, pp. 258–271, 2019.
- X. Xu, J. Liu, Y. Wang, and X. Ke, “Academic Expert Finding via (k,p)-core based Embedding over Heterogeneous Graphs,” in ICDE, 2022, pp. 338–351.
- Y. Wang, J. Liu, X. Xu, X. Ke, T. Wu, and X. Gou, “Efficient and effective academic expert finding on heterogeneous graphs through (k,p)-core based embedding,” ACM Trans. Knowl. Discov. Data, vol. 17, no. 6, mar 2023.
- N. Barbieri, F. Bonchi, E. Galimberti, and F. Gullo, “Efficient and effective community search,” Data Min. Knowl. Discov., vol. 29, no. 5, pp. 1406–1433, 2015.
- W. Cui, Y. Xiao, H. Wang, and W. Wang, “Local Search of Communities in Large Graphs,” in SIGMOD, 2014, pp. 991–1002.
- X. Huang, L. V. S. Lakshmanan, J. X. Yu, and H. Cheng, “Approximate Closest Community Search in Networks,” PVLDB, vol. 9, no. 4, pp. 276–287, 2015.
- X. Huang, H. Cheng, L. Qin, W. Tian, and J. X. Yu, “Querying k-truss community in large and dynamic graphs,” in SIGMOD, 2014, pp. 1311–1322.
- E. Akbas and P. Zhao, “Truss-based community search: A truss-equivalence based indexing approach,” PVLDB, vol. 10, no. 11, pp. 1298–1309, 2017.
- W. Cui, Y. Xiao, H. Wang, Y. Lu, and W. Wang, “Online Search of Overlapping Communities,” in SIGMOD, 2013, pp. 277–288.
- C. E. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. A. Tsiarli, “Denser than the densest subgraph: Extracting optimal quasi-cliques with quality guarantees,” in KDD, 2013, pp. 104–112.
- Y. Fang, X. Huang, L. Qin, Y. Zhang, W. Zhang, R. Cheng, and X. Lin, “A survey of community search over big graphs,” VLDBJ, vol. 29, no. 1, pp. 353–392, 2020.
- Y. Fang, R. Cheng, S. Luo, and J. Hu, “Effective community search for large attributed graphs,” PVLDB, vol. 9, no. 12, pp. 1233–1244, 2016.
- Y. Zhu, J. He, J. Ye, L. Qin, X. Huang, and J. X. Yu, “When structure meets keywords: Cohesive attributed community search,” in CIKM, 2020, pp. 1913–1922.
- S. Kosub, “A note on the triangle inequality for the jaccard distance,” Pattern Recognit. Lett., vol. 120, pp. 36–38, 2019.
- N. Laptev, K. Zeng, and C. Zaniolo, “Early accurate results for advanced analytics on mapreduce,” PVLDB, vol. 5, no. 10, pp. 1028–1039, 2012.
- S. Chaudhuri, B. Ding, and S. Kandula, “Approximate query processing: No silver bullet,” in SIGMOD, S. Salihoglu, W. Zhou, R. Chirkova, J. Yang, and D. Suciu, Eds., 2017, pp. 511–519.
- Y. Wang, A. Khan, X. Xu, J. Jin, Q. Hong, and T. Fu, “Aggregate Queries on Knowledge Graphs: Fast Approximation with Semantic-aware Sampling,” in ICDE, 2022.
- Y. Wang, J. Luo, A. Song, and F. Dong, “A sampling-based hybrid approximate query processing system in the cloud,” in ICPP, 2014, pp. 291–300.
- K. Yao and L. Chang, “Efficient size-bounded community search over large networks,” Proc. VLDB Endow., vol. 14, no. 8, pp. 1441–1453, 2021.
- F. Bonchi, A. Khan, and L. Severini, “Distance-generalized core decomposition,” in SIGMOD, 2019, pp. 1006–1023.
- J. Hu, X. Wu, R. Cheng, S. Luo, and Y. Fang, “Querying Minimal Steiner Maximum-connected Subgraphs in Large Graphs,” in CIKM, 2016, pp. 1241–1250.
- R. Li, L. Qin, J. X. Yu, and R. Mao, “Influential Community Search in Large Networks,” PVLDB, vol. 8, no. 5, pp. 509–520, 2015.
- R. Li, L. Qin, F. Ye, J. X. Yu, X. Xiao, N. Xiao, and Z. Zheng, “Skyline community search in multi-valued networks,” in SIGMOD, 2018, pp. 457–472.
- M. Wang, L. Lv, X. Xu, Y. Wang, Q. Yue, and J. Ni, “An efficient and robust framework for approximate nearest neighbor search with attribute constraint,” in NeurIPS, 2024.
- M. El-Kebir and G. W. Klau, “Solving the maximum-weight connected subgraph problem to optimality,” arXiv, vol. abs/1409.5308, 2014.
- A. Santuari, “Steiner tree np-completeness proof,” University of Trento, Tech. Rep., 2003.
- J. Byrka, F. Grandoni, T. Rothvoß, and L. Sanità, “An improved lp-based approximation for steiner tree,” in STOC, L. J. Schulman, Ed., 2010, pp. 583–592.
- V. Batagelj and M. Zaversnik, “An o (m) algorithm for cores decomposition of networks,” arXiv, vol. cs.DS/0310049, 2003.
- W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, pp. 409–426, 1994.
- A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “A Scalable Bootstrap for Massive Data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 4, pp. 795–816, 2014.
- J. M. Kleinberg, “Navigation in a small world,” Nature, vol. 406, pp. 845–845, 2000.
- Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, 2020.
- J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” in ICDM, 2012, pp. 745–754.
- D. Cheng, C. Chen, X. Wang, and S. Xiang, “Efficient top-k vulnerable nodes detection in uncertain graphs,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 2, pp. 1460–1472, 2023.
- F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich, “Core decomposition of uncertain graphs,” in SIGKDD, 2014, pp. 1316–1325.
- Y. Peng, Y. Zhang, W. Zhang, X. Lin, and L. Qin, “Efficient probabilistic k-core computation on uncertain graphs,” in ICDE, 2018, pp. 1192–1203.
- J. Gao, X. Li, Y. E. Xu, B. Sisman, X. L. Dong, and J. Yang, “Efficient knowledge graph accuracy evaluation,” Proc. VLDB Endow., vol. 12, no. 11, pp. 1679–1691, 2019.
- A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “The big data bootstrap,” in ICML, 2012.
- Y. Yang, Y. Fang, X. Lin, and W. Zhang, “Effective and Efficient Truss Computation over Large Heterogeneous Information Networks,” in ICDE, 2020, pp. 901–912.
- Y. Zhou, Y. Fang, W. Luo, and Y. Ye, “Influential community search over large heterogeneous information networks,” PVLDB, vol. 16, no. 8, pp. 2047–2060, 2023.
- Y. Ma, Y. Yuan, F. Zhu, G. Wang, J. Xiao, and J. Wang, “Who should be invited to my party: A size-constrained k-core problem in social networks,” J. Comput. Sci. Technol., vol. 34, no. 1, pp. 170–184, 2019.
- Code and datasets, “Code and datasets,” https://anonymous.4open.science/r/SEA-Update-D18E/README.md, 2023.
- J. J. McAuley and J. Leskovec, “Learning to discover social circles in ego networks,” in NIPS, 2012, pp. 548–556.
- B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed node embedding,” J. Complex Networks, vol. 9, no. 2, 2021.
- B. Rozemberczki and R. Sarkar, “Twitch gamers: a dataset for evaluating proximity preserving and structural role-based node embeddings,” arXiv, vol. abs/2101.03091, 2021.
- R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graph analytics and visualization,” in AAAI, 2015.
- “DBLP,” http://dblp.uni-trier.de/xml/, 2023.
- “IMDB,” https://www.imdb.com/interfaces/, 2023.
- “DBpedia,” https://wiki.dbpedia.org/Datasets, 2023.
- K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, and J. Taylor, “Freebase: A collaboratively created graph database for structuring human knowledge,” in SIGMOD, 2008, pp. 1247–1250.
- T. Rebele, F. M. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and G. Weikum, “YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames,” in ISWC, 2016, pp. 177–185.
- O. dataset, “Orkut dataset,” https://www.comp.hkbu.edu.hk/∼db/book/communitysearch.html, 2023.
- J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” Knowl. Inf. Syst., vol. 42, no. 1, pp. 181–213, 2015.
- Y. Fang, Z. Wang, R. Cheng, H. Wang, and J. Hu, “Effective and efficient community search over large directed graphs,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 11, pp. 2093–2107, 2019.
- Q. Liu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “Truss-based community search over large directed graphs,” in SIGMOD, 2020, pp. 2183–2197.
- L. Yuan, L. Qin, W. Zhang, L. Chang, and J. Yang, “Index-based densest clique percolation community search in networks,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 5, pp. 922–935, 2018.
- L. Chang, X. Lin, L. Qin, J. X. Yu, and W. Zhang, “Index-based optimal algorithms for computing steiner components with maximum connectivity,” in SIGMOD, 2015, pp. 459–474.
- J. Hu, X. Wu, R. Cheng, S. Luo, and Y. Fang, “On minimal steiner maximum-connected subgraph queries,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 11, pp. 2455–2469, 2017.
- Y. Wu, R. Jin, J. Li, and X. Zhang, “Robust local community detection: On free rider effect and its elimination,” PVLDB, vol. 8, no. 7, pp. 798–809, 2015.
- L. Chen, C. Liu, R. Zhou, J. Li, X. Yang, and B. Wang, “Maximum co-located community search in large scale social networks,” PVLDB, vol. 11, no. 10, pp. 1233–1246, 2018.
- L. Chen, C. Liu, K. Liao, J. Li, and R. Zhou, “Contextual community search over large social networks,” in ICDE, 2019, pp. 88–99.
- Y. Fang, R. Cheng, Y. Chen, S. Luo, and J. Hu, “Effective and efficient attributed community search,” VLDBJ, vol. 26, no. 6, pp. 803–828, 2017.
- L. Qiao, Z. Zhang, Y. Yuan, C. Chen, and G. Wang, “Keyword-centric community search over large heterogeneous information networks,” in DASFAA, vol. 12681, 2021, pp. 158–173.