ProvG-Searcher: A Graph Representation Learning Approach for Efficient Provenance Graph Search (2309.03647v2)
Abstract: We present ProvG-Searcher, a novel approach for detecting known APT behaviors within system security logs. Our approach leverages provenance graphs, a comprehensive graph representation of event logs, to capture and depict data provenance relations by mapping system entities as nodes and their interactions as edges. We formulate the task of searching provenance graphs as a subgraph matching problem and employ a graph representation learning method. The central component of our search methodology involves embedding of subgraphs in a vector space where subgraph relationships can be directly evaluated. We achieve this through the use of order embeddings that simplify subgraph matching to straightforward comparisons between a query and precomputed subgraph representations. To address challenges posed by the size and complexity of provenance graphs, we propose a graph partitioning scheme and a behavior-preserving graph reduction method. Overall, our technique offers significant computational efficiency, allowing most of the search computation to be performed offline while incorporating a lightweight comparison step during query execution. Experimental results on standard datasets demonstrate that ProvG-Searcher achieves superior performance, with an accuracy exceeding 99% in detecting query behaviors and a false positive rate of approximately 0.02%, outperforming other approaches.
- ATLAS: A Sequence-based Learning Approach for Attack Investigation. In USENIX Security Symposium.
- Ben Athiwaratkun and Andrew Gordon Wilson. 2018. Hierarchical density order embeddings. arXiv preprint arXiv:1804.09843 (2018).
- MITRE ATT&CK. 2021. MITRE ATT&CK. https://attack.mitre.org. Accessed: February 28, 2023.
- Accurate learning of graph representations with graph multiset pooling. arXiv preprint arXiv:2102.11533.
- Simgnn: A neural network approach to fast graph similarity computation. In WSDM.
- Trustworthy whole-system provenance for the linux kernel. In USENIX Security Symposium. 319–334.
- A Survey on Malware Detection with Graph Representation Learning. arXiv preprint arXiv:2303.16004 (2023).
- Graph representation learning: a survey. APSIPA (2020), e15.
- One-class order embedding for dependency relation prediction. In ACM SIGIR. 205–214.
- Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In SIGKDD. 257–266.
- DARPA. 2014. Transparent Computing. http://www.darpa.mil/program/transparent-computing.
- LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs. arXiv preprint arXiv:2102.10588 (2021).
- Ashita Diwan. 2021. Representation Learning for Vulnerability Detection on Assembly Code. McGill University (Canada).
- {{\{{Back-Propagating}}\}} System Dependency Impact for Attack Investigation. In USENIX Security Symposium. 2461–2478.
- SEAL: Storage-efficient Causality Analysis on Enterprise Logs with Query-friendly Compression.. In USENIX Security Symposium. 2987–3004.
- Fidelis. 2013. fta-1009—njrat-uncovered-1.pdf — Box Destekli. https://app.box.com/s/vdg51zbfvap52w60zj0is3l1dmyya0n4. (Accessed on 08/13/2023).
- Fortinet. 2019. Analysis of a New HawkEye Variant. https://www.fortinet.com/blog/threat-research/hawkeye-malware-analysis. (Accessed on 08/13/2023).
- Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence. In ICDE. 193–204.
- Inductive Representation Learning on Large Graphs. In NIPS.
- Representation Learning on Graphs: Methods and Applications. IEEE Data Eng. Bull. (2017).
- Unicorn: Runtime provenance-based detector for advanced persistent threats. arXiv preprint arXiv:2001.01525 (2020).
- Towards scalable cluster auditing through grammatical inference over provenance graphs. In NDSS.
- Tactical provenance analysis for endpoint detection and response systems. In S&P. 1172–1189.
- Nodoze: Combatting threat alert fatigue with automated provenance triage. In NDSS.
- This is why we can’t cache nice things: Lightning-fast threat hunting using suspicion-based hierarchical storage. In ACSAC. 165–178.
- OmegaLog: High-fidelity attack investigation via transparent multi-layer log analysis. In NDSS.
- SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data.. In USENIX Security Symposium. 487–504.
- Combating dependence explosion in forensic analysis using alternative tag propagation semantics. In S&P. 1139–1155.
- Dependence-preserving data compaction for scalable forensic analysis. In USENIX Security Symposium. 1723–1740.
- Heterogeneous graph transformer. In Proceedings of The Web Conference 2020. 2704–2710.
- Tackling over-smoothing for general graph convolutional networks. arXiv preprint arXiv:2008.09864 (2020).
- Kaspersky. 2015. Carbanak_APT_eng.pdf. https://media.kasperskycontenthub.com/wp-content/uploads/sites/43/2018/03/08064518/Carbanak_APT_eng.pdf. (Accessed on 08/13/2023).
- Nema: Fast graph search with label similarity. VLDB Endowment 6, 181–192.
- Samuel T King and Peter M Chen. 2003. Backtracking intrusions. In SOSP. 223–236.
- T. Kipf and M. Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.
- MCI: Modeling-based Causality Inference in Audit Logging for Attack Investigation. In NDSS. 4.
- Sub-gmn: The subgraph matching network model. arXiv preprint arXiv:2104.00186.
- High Accuracy Attack Provenance via Binary-based Execution Partition. In NDSS, Vol. 16.
- LogGC: garbage collecting audit log. In SIGSAC. 1005–1016.
- Graph matching networks for learning the similarity of graph structured objects. In ICML. 3835–3845.
- A hierarchical approach for advanced persistent threat detection with attention-based graph neural networks. Security and Communication Networks 2021 (2021), 1–14.
- IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 12, i253–i258.
- Log2vec: A heterogeneous graph embedding based approach for detecting cyber threats within enterprise. In SIGSAC. 1777–1794.
- G-finder: Approximate attributed subgraph matching. In IEEE BigData. 513–522.
- Towards a Timely Causality Analysis for Enterprise Security.. In NDSS.
- Neural subgraph matching. arXiv preprint arXiv:2007.03092.
- Andreas Loukas. 2019. What graph neural networks cannot learn: depth vs width. arXiv preprint arXiv:1907.03199 (2019).
- A fast projected fixed-point algorithm for large graph matching. Pattern Recognition, 971–982.
- MPI: Multiple Perspective Attack Investigation with Semantic Aware Execution Partitioning. In USENIX Security Symposium. 1111–1128.
- Protracer: Towards Practical Provenance Tracing by Alternating Between Logging and Tainting. In NDSS.
- Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In SIGKDD. 1035–1044.
- On the forensic validity of approximated audit logs. In ACSAC. 189–202.
- Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. In SIGSAC. 1795–1812.
- Holmes: real-time apt detection through correlation of suspicious information flows. In S&P. 1137–1152.
- The open provenance model: An overview. In IPAW. 323–326.
- Kiran-Kumar Muniswamy-Reddy and Margo Seltzer. 2010. Provenance as first class cloud data. SIGOPS (2010), 11–16.
- Practical whole-system provenance capture. In SoCC. 405–418.
- Hercule: Attack story reconstruction via community discovery on correlated log graph. In ACSAC. 583–595.
- Mage: Matching approximate patterns in richly-attributed graphs. In IEEE BigData. 585–590.
- Interpretable Neural Subgraph Matching for Graph Retrieval. In AAAI, Vol. 36. 8115–8123.
- Extractor: Extracting attack behavior from threat reports. In EuroS&P. 598–615.
- The graph neural network model. IEEE transactions on neural networks 20, 61–80.
- Modeling relational data with graph convolutional networks. In ESWC 2018. 593–607.
- Inspector: data provenance using intel processor trace (pt). In ICDCS. 25–34.
- SAGA: a subgraph matching tool for biological graphs. Bioinformatics 23, 232–239.
- Fast best-effort pattern matching in large attributed graphs. In SIGKDD. 737–746.
- Jacob Torrey. 2020. Transparent Computing Engagement 3 Data Release. https://www.darpa.mil/program/transparent-computing
- Graph Attention Networks. In ICLR.
- Order-embeddings of images and language. arXiv preprint arXiv:1511.06361.
- Probabilistic embedding of knowledge graphs with box lattice measures. arXiv preprint arXiv:1805.06627 (2018).
- You Are What You Do: Hunting Stealthy Malware via Data Provenance Analysis.. In NDSS.
- Threatrace: Detecting and tracing host-based threats in node level through provenance graph learning. TIFS 17 (2022), 3972–3987.
- Deephunter: A graph neural network based approach for robust cyber threat hunting. In SecureComm. Springer, 3–24.
- Relation-aware entity alignment for heterogeneous knowledge graphs. arXiv preprint arXiv:1908.08210 (2019).
- Pagoda: A hybrid approach to enable efficient real-time provenance based intrusion detection in big data environments. IEEE TDSC 17, 1283–1296.
- CONAN: A practical real-time APT detection system with high accuracy and efficiency. IEEE TDSC 19, 1, 551–565.
- How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).
- Cross-lingual knowledge graph alignment via graph matching neural network. arXiv preprint arXiv:1905.11605 (2019).
- Depcomm: Graph summarization on system audit logs for attack investigation. In S&P. 540–557.
- High fidelity data reduction for big data security dependency analyses. In SIGSAC. 504–516.
- Decoupling the depth and scope of graph neural networks. NeurIPS (2021), 19665–19679.
- Graphsaint: Graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931 (2019).
- WATSON: Abstracting Behaviors from Audit Logs via Aggregation of Contextual Semantics.. In NDSS.
- Shadewatcher: Recommendation-guided cyber threat analysis using system audit records. In S&P. 489–506.
- APTSHIELD: A Stable, Efficient and Real-time APT Detection System for Linux Hosts. IEEE TDSC.
- Behavior query discovery in system-generated temporal graphs. arXiv preprint arXiv:1511.05911 (2015).