- The paper introduces HiGraph, a dataset featuring over 200 million Control Flow Graphs nested within 595,000 Function Call Graphs for enhanced malware detection.
- It employs a dual-layer hierarchical graph structure to capture both local and global semantic relationships, improving robustness against obfuscation.
- Empirical analysis reveals that malware samples exhibit higher PageRank and cyclomatic complexity, guiding the development of advanced graph neural network models.
HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis
Introduction
The paper "HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis" addresses the critical limitations in graph-based malware detection caused by the lack of datasets that encapsulate the hierarchical structure inherent in software binaries. Existing datasets often resort to simplistic, flat graph representations failing to model the complex semantic relationships between function calls and the lower-level control flow. HiGraph emerges as a pioneering dataset in this space, exhibiting over 200 million Control Flow Graphs (CFGs) nested within approximately 595,000 Function Call Graphs (FCGs). This hierarchical organization facilitates robust and resilient malware detection approaches that withstand code obfuscation and evolution.
Figure 1: Evolution of CryptoLocker ransomware illustrates persistent malicious behavior detectable via graph-based analysis.
Dataset Construction
HiGraph's meticulous construction involved acquiring a vast array of Android applications from the AndroZoo repository, reflecting a temporal span from 2012 to 2022. The dataset comprises 595,211 distinct applications, labeled through an analysis of VirusTotal reports to classify them into benign or malicious categories. AVClass2 was employed for more granular classification into malware families.
The core of HiGraph is its dual-level graph structure, consisting of CFGs and FCGs. CFGs were extracted at the function level, capturing intra-procedural control logic, while FCGs detailed inter-procedural function call relationships within each application. This hierarchical graph architecture exceeds prior dataset scales, as substantiated by detailed comparative metrics with existing corpora.
Figure 2: Overview of the HiGraph construction pipeline, illustrating dataset collection, labeling, and graph extraction processes.
Empirical Analysis
The empirical analysis of HiGraph reveals distinct structural characteristics of benign and malicious samples, demonstrating their unique graph properties. Malware often exhibits higher PageRank values and cyclomatic complexities, indicating a more centralized and logically complex architecture in both CFGs and FCGs. Such structural insights guide the development of detection models that leverage these differences.
Temporal analysis showcases the evolution of malware families over a decade, noting a rapid emergence of new families peaking around 2019-2020. This temporal data consistency in HiGraph serves as a powerful tool for studying concept drift in malware detection.
Figure 3: Monthly distribution of top malware families, highlighting dynamic trends in prevalence over time.
Evaluating HiGraph for Malware Detection
The hierarchical graph structure of HiGraph lends itself to effective malware detection through advanced graph neural networks (GNNs). Hi-GNN, a proposed model exploiting HiGraph's hierarchical nature, significantly outperforms traditional GNNs in both binary and multi-class classification tasks. Hi-GNN's architecture, integrating separate CFG and FCG encoders, captures both local logic and global structural features, resulting in superior detection capabilities.



Figure 4: Correlation between FCG average PageRank and CFG cyclomatic complexity for malware applications, emphasizing central and complex function nodes.
Conclusion
HiGraph's introduction opens a new avenue for malware analysis grounded in hierarchical graph learning. By capturing both intra- and inter-procedural semantics, it enables the development and evaluation of sophisticated malware detection models resilient to temporal changes and obfuscation strategies. Through open access and comprehensive documentation, HiGraph stands as a critical resource poised to elevate the community's ability to design robust cybersecurity solutions.