Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Threat Behavior Textual Search by Attention Graph Isomorphism (2404.10944v2)

Published 16 Apr 2024 in cs.IR

Abstract: Cyber attacks cause over \$1 trillion loss every year. An important task for cyber security analysts is attack forensics. It entails understanding malware behaviors and attack origins. However, existing automated or manual malware analysis can only disclose a subset of behaviors due to inherent difficulties (e.g., malware cloaking and obfuscation). As such, analysts often resort to text search techniques to identify existing malware reports based on the symptoms they observe, exploiting the fact that malware samples share a lot of similarity, especially those from the same origin. In this paper, we propose a novel malware behavior search technique that is based on graph isomorphism at the attention layers of Transformer models. We also compose a large dataset collected from various agencies to facilitate such research. Our technique outperforms state-of-the-art methods, such as those based on sentence embeddings and keywords by 6-14%. In the case study of 10 real-world malwares, our technique can correctly attribute 8 of them to their ground truth origins while using Google only works for 3 cases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. BERT. https://github.com/google-research/bert. Accessed: 2022-06-20.
  2. Alexander Budanitsky and Graeme Hirst. 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Computational linguistics, 32(1):13–47.
  3. CAPEC. https://capec.mitre.org/. Accessed: 2022-06-20.
  4. Automatic extraction of indicators of compromise for web applications. In Proceedings of the 25th international conference on world wide web, pages 333–343.
  5. Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 740–750.
  6. The language of legal and illegal activity on the darknet. arXiv preprint arXiv:1905.05543.
  7. Courtney D Corley and Rada Mihalcea. 2005. Measuring the semantic similarity of texts. In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, pages 13–18.
  8. Structured lexical similarity via convolution kernels on dependency trees. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1034–1046.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. Economictimes. 2019. What is dtrack: North korean virus being used to hack atms to nuclear power plant in india. Published: 2019-10-22.
  11. Justin Ferguson and Dan Kaminsky. 2008. Reverse engineering code with IDA Pro. Syngress.
  12. Enabling efficient cyber threat hunting with cyber threat intelligence. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 193–204. IEEE.
  13. Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies, 8(1):1–254.
  14. Hua He and Jimmy Lin. 2016. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In Proceedings of the 2016 conference of the north American chapter of the Association for Computational Linguistics: human language technologies, pages 937–948.
  15. Ttpdrill: Automatic and accurate extraction of threat actions from unstructured text of cti sources. In Proceedings of the 33rd Annual Computer Security Applications Conference, pages 103–115.
  16. IDA. https://hex-rays.com/. Accessed: 2022-06-20.
  17. INDIA TODAY. 2019. What is dtrack: North korean virus being used to hack atms to nuclear power plant in india. Published: 2019-10-30.
  18. IOC Parser. https://github.com/PaloAltoNetworks/ioc-parser. Accessed: 2022-06-20.
  19. Aminul Islam and Diana Inkpen. 2008. Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2(2):1–25.
  20. Shedding new light on the language of the dark web. arXiv preprint arXiv:2204.06885.
  21. Tom Kenter and Maarten De Rijke. 2015. Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management, pages 1411–1420.
  22. Crowdsourcing cybersecurity: Cyber attack detection using social media. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1049–1057.
  23. Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.
  24. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
  25. Acing the ioc game: Toward automatic discovery and analysis of open-source cyber threat intelligence. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 755–766.
  26. Word n-gram attention models for sentence similarity and inference. Expert Systems with Applications, 132:1–11.
  27. Malpedia. https://malpedia.caad.fkie.fraunhofer.de/. Accessed: 2022-06-20.
  28. context2vec: Learning generic context embedding with bidirectional lstm. In Proceedings of the 20th SIGNLL conference on computational natural language learning, pages 51–61.
  29. Corpus-based and knowledge-based measures of text semantic similarity. In Aaai, volume 6, pages 775–780.
  30. Holmes: real-time apt detection through correlation of suspicious information flows. In 2019 IEEE Symposium on Security and Privacy (SP), pages 1137–1152. IEEE.
  31. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
  32. Mitre ATTACK. https://attack.mitre.org/. Accessed: 2022-06-20.
  33. NLTK. https://www.nltk.org/. Accessed: 2022-06-20.
  34. Digit Oktavianto and Iqbal Muhardianto. 2013. Cuckoo malware analysis. Packt Publishing Ltd.
  35. Semantic cosine similarity. In The 7th international student conference on advanced science and technology ICAST, volume 4, page 1.
  36. Random walks for text semantic similarity. In Proceedings of the 2009 workshop on graph-based methods for natural language processing (TextGraphs-4), pages 23–31.
  37. Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 29–48. New Jersey, USA.
  38. Bridging the gap between relevance matching and semantic matching for short text similarity modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5370–5381.
  39. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  40. Securelist. 2018. Operation applejeus: Lazarus hits cryptocurrency exchange with fake installer and macos malware. Published: 2018-08-23.
  41. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  42. Mass: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450.
  43. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8968–8975.
  44. Maciej M Sys et al. 1982. The subgraph isomorphism problem for outerplanar graphs. Theoretical Computer Science, 17(1):91–97.
  45. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
  46. TechNadu. 2019. The lazarus group is using a new banking malware against indian banks.
  47. Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity. Information Processing & Management, 56(6):102090.
  48. TrendMicro. 2018. Lazarus campaign uses remote tools, ratankba, and more. Published: 2018-01-24.
  49. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962.
  50. US-CERT. 2017. Hidden cobra – north korean remote administration tool: Fallchill. Published: 2018-08-23.
  51. Attention is all you need. Advances in neural information processing systems, 30.
  52. VirusTotal. https://www.virustotal.com/. Accessed: 2022-06-20.
  53. Into the deep web: Understanding e-commercefraud from autonomous chat with cybercriminals. In Proceedings of the ISOC Network and Distributed System Security Symposium (NDSS), 2020.
  54. Sentence similarity learning by lexical decomposition and composition. arXiv preprint arXiv:1602.07019.
  55. How does bert capture semantics? a closer look at polysemous words. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 156–162.
  56. Ilsun You and Kangbin Yim. 2010. Malware obfuscation techniques: A brief survey. In 2010 International conference on broadband, wireless computing, communication and applications, pages 297–300. IEEE.
  57. ZDNet. 2019. New north korean malware targeting atms spotted in india.
  58. Eugenia Lostri James A. Lewis Zhanna Malekos Smith. 2020. The hidden costs of cybercrime. Accessed: 2017-11-14.
  59. Ziyun Zhu and Tudor Dumitras. 2018. Chainsmith: Automatically learning the semantics of malicious campaigns by mining threat intelligence reports. In 2018 IEEE European Symposium on Security and Privacy (EuroS&P), pages 458–472. IEEE.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com