Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cyber-Security Knowledge Graph Generation by Hierarchical Nonnegative Matrix Factorization (2403.16222v2)

Published 24 Mar 2024 in cs.AI

Abstract: Much of human knowledge in cybersecurity is encapsulated within the ever-growing volume of scientific papers. As this textual data continues to expand, the importance of document organization methods becomes increasingly crucial for extracting actionable insights hidden within large text datasets. Knowledge Graphs (KGs) serve as a means to store factual information in a structured manner, providing explicit, interpretable knowledge that includes domain-specific information from the cybersecurity scientific literature. One of the challenges in constructing a KG from scientific literature is the extraction of ontology from unstructured text. In this paper, we address this topic and introduce a method for building a multi-modal KG by extracting structured ontology from scientific papers. We demonstrate this concept in the cybersecurity domain. One modality of the KG represents observable information from the papers, such as the categories in which they were published or the authors. The second modality uncovers latent (hidden) patterns of text extracted through hierarchical and semantic non-negative matrix factorization (NMF), such as named entities, topics or clusters, and keywords. We illustrate this concept by consolidating more than two million scientific papers uploaded to arXiv into the cyber-domain, using hierarchical and semantic NMF, and by building a cyber-domain-specific KG.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. M. Eren, N. Solovyev, R. Barron, M. Bhattarai, D. Truong, I. Boureima, E. Skau, K. Rasmussen, and B. Alexandrov, “Tensor Extraction of Latent Features (T-ELF),” Los Alamos National Laboratories, Tech. Rep., Oct. 2023. [Online]. Available: https://github.com/lanl/T-ELF
  2. R. Vangara, E. Skau, G. Chennupati, H. Djidjev, T. Tierney, J. P. Smith, M. Bhattarai, V. G. Stanev, and B. S. Alexandrov, “Semantic nonnegative matrix factorization with automatic model determination for topic modeling,” in 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), 2020, pp. 328–335.
  3. M. E. Eren, N. Solovyev, M. Bhattarai, K. Rasmussen, C. Nicholas, and B. Alexandrov, “Senmfk-split: Large corpora topic modeling by semantic non-negative matrix factorization with automatic model selection,” in Proceedings of the ACM Symposium on Document Engineering 2022, ser. DocEng ’22.   New York, NY, USA: Association for Computing Machinery, 2022.
  4. P. Manghi, A. Mannocci, F. Osborne, D. Sacharidis, A. Salatino, and T. Vergoulis, “New trends in scientific knowledge graphs and research impact assessment,” pp. 1296–1300, 2021.
  5. A. Pingle, A. Piplai, S. Mittal, A. Joshi, J. Holt, and R. Zak, “Relext: Relation extraction using deep learning approaches for cybersecurity knowledge graph improvement,” in Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2019, pp. 879–886.
  6. Y. Luan, L. He, M. Ostendorf, and H. Hajishirzi, “Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds.   Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 3219–3232. [Online]. Available: https://aclanthology.org/D18-1360
  7. C. Wang, X. Ma, J. Chen, and J. Chen, “Information extraction and knowledge graph construction from geoscience literature,” Computers & Geosciences, vol. 112, pp. 112–120, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0098300417309020
  8. S. Wang, A. Ororbia, Z. Wu, K. Williams, C. Liang, B. Pursel, and C. L. Giles, “Using prerequisites to extract concept maps fromtextbooks,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ser. CIKM ’16.   New York, NY, USA: Association for Computing Machinery, 2016, p. 317–326. [Online]. Available: https://doi.org/10.1145/2983323.2983725
  9. B. Shao, X. Li, and G. Bian, “A survey of research hotspots and frontier trends of recommendation systems from the perspective of knowledge graph,” Expert Systems with Applications, vol. 165, p. 113764, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417420305881
  10. B. Abu-Salih, “Domain-specific knowledge graphs: A survey,” Journal of Network and Computer Applications, vol. 185, p. 103076, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1084804521000990
  11. P. Chen, Y. Lu, V. W. Zheng, X. Chen, and B. Yang, “Knowedu: A system to construct knowledge graph for education,” IEEE Access, vol. 6, pp. 31 553–31 563, 2018.
  12. A. L. Opdahl, T. Al-Moslmi, D.-T. Dang-Nguyen, M. Gallofré Ocaña, B. Tessem, and C. Veres, “Semantic knowledge graphs for the news: A review,” ACM Computing Surveys, vol. 55, no. 7, pp. 1–38, 2022.
  13. L. F. Sikos, “Cybersecurity knowledge graphs,” Knowledge and Information Systems, vol. 65, no. 9, pp. 3511–3531, 2023.
  14. S. Auer, A. Oelen, M. Haris, M. Stocker, J. D’Souza, K. E. Farfar, L. Vogt, M. Prinz, V. Wiens, and M. Y. Jaradeh, “Improving access to scientific literature with knowledge graphs,” Bibliothek Forschung und Praxis, vol. 44, no. 3, pp. 516–529, 2020.
  15. Y. Zhu, E. Yan, and I.-Y. Song, “The use of a graph-based system to improve bibliographic information retrieval: System design, implementation, and evaluation,” Journal of the Association for Information Science and Technology, vol. 68, no. 2, pp. 480–490, 2017.
  16. M. Iannacone, S. Bohn, G. Nakamura, J. Gerth, K. Huffer, R. Bridges, E. Ferragut, and J. Goodall, “Developing an ontology for cyber security knowledge graphs,” in Proceedings of the 10th Annual Cyber and Information Security Research Conference, ser. CISR ’15.   New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2746266.2746278
  17. H. Chen and X. Luo, “An automatic literature knowledge graph and reasoning network modeling framework based on ontology and natural language processing,” Advanced Engineering Informatics, vol. 42, p. 100959, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1474034619302642
  18. Y.-J. Yang, B. Xu, J.-W. Hu, M.-H. Tong, P. Zhang, and L. Zheng, “Accurate and efficient method for constructing domain knowledge graph,” Ruan Jian Xue Bao/Journal of Software, vol. 29, pp. 2931–2947, 10 2018.
  19. R. Wang, Y. Yan, J. Wang, Y. Jia, Y. Zhang, W. Zhang, and X. Wang, “Acekg: A large-scale knowledge graph for academic data mining,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, ser. CIKM ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 1487–1490. [Online]. Available: https://doi.org/10.1145/3269206.3269252
  20. S. M. Tiwari, F. N. Al-Aswadi, and D. Gaurav, “Recent trends in knowledge graphs: theory and practice,” Soft Computing, vol. 25, pp. 8337 – 8355, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:234813288
  21. J. Costello, M. Reformat, and F. Bolduc, “Leveraging knowledge graphs and natural language processing for automated web resource labeling: Knowledge mobilization in neurodevelopmental disorders. (preprint),” Journal of Medical Internet Research, vol. 25, 12 2022.
  22. D. Flocco, B. Palmer-Toy, R. Wang, H. Zhu, R. Sonthalia, J. Lin, A. L. Bertozzi, and P. J. Brantingham, “An analysis of covid-19 knowledge graph construction and applications,” in 2021 IEEE International Conference on Big Data (Big Data).   Los Alamitos, CA, USA: IEEE Computer Society, dec 2021, pp. 2631–2640. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/BigData52589.2021.9671479
  23. T. Griffiths, M. Jordan, J. Tenenbaum, and D. Blei, “Hierarchical topic models and the nested chinese restaurant process,” in Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf, Eds., vol. 16.   MIT Press, 2003.
  24. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical dirichlet processes,” Journal of the American Statistical Association, vol. 101, pp. 1566 – 1581, 2006.
  25. Y. Jin, H. Zhao, M. Liu, L. Du, and W. Buntine, “Neural attention-aware hierarchical topic model,” 2021.
  26. D. Guo, B. Chen, R. Lu, and M. Zhou, “Recurrent hierarchical topic-guided neural language models,” 2020. [Online]. Available: https://openreview.net/forum?id=Byl1W1rtvH
  27. F. Viegas, W. Cunha, C. Gomes, A. Pereira, L. C. da Rocha, and M. A. Gonçalves, “Cluhtm - semantic hierarchical topic modeling based on cluwords,” in Annual Meeting of the Association for Computational Linguistics, 2020.
  28. S. Mifrah and E. H. Benlahmar, “Topic modeling with transformers for sentence-level using coronavirus corpus,” International Journal of Interactive Mobile Technologies (iJIM), vol. 16, no. 17, p. pp. 50–59, Sep. 2022. [Online]. Available: https://online-journals.org/index.php/i-jim/article/view/33281
  29. M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,” 2022.
  30. C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi, “On the use of arxiv as a dataset,” 2019.
  31. N. Shuyo, “Language detection library for java,” 2010. [Online]. Available: http://code.google.com/p/language-detection/
  32. O. Levy and Y. Goldberg, “Neural word embedding as implicit matrix factorization,” Advances in neural information processing systems, vol. 27, 2014.
  33. V. Stanev, E. Skau, I. Takeuchi, and B. S. Alexandrov, “Topic analysis of superconductivity literature by semantic non-negative matrix factorization,” in International Conference on Large-Scale Scientific Computing.   Springer, 2021, pp. 359–366.
  34. R. Vangara, M. Bhattarai, E. Skau, G. Chennupati, H. Djidjev, T. Tierney, J. P. Smith, V. G. Stanev, and B. S. Alexandrov, “Finding the number of latent topics with semantic non-negative matrix factorization,” IEEE Access, vol. 9, pp. 117 217–117 231, 2021.
  35. OpenAI’s DALL·E, “Visual representations of kg schema,” 2024.
  36. Neo4j, Inc., “Neo4j: The #1 platform for connected data,” https://neo4j.com/, 2023.
  37. Explosion AI, “spacy english core web transformer model,” https://spacy.io/models/en#en_core_web_trf, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Ryan Barron (8 papers)
  2. Maksim E. Eren (16 papers)
  3. Manish Bhattarai (38 papers)
  4. Nicholas Solovyev (4 papers)
  5. Kim Rasmussen (10 papers)
  6. Boian S. Alexandrov (31 papers)
  7. Charles Nicholas (32 papers)
  8. Cynthia Matuszek (23 papers)
  9. Selma Wanna (8 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.