HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis

Published 2 Sep 2025 in cs.LG, cs.AI, cs.CR, and cs.SI | (2509.02113v1)

Abstract: The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce \dataset, the largest public hierarchical graph dataset for malware analysis, comprising over \textbf{200M} Control Flow Graphs (CFGs) nested within \textbf{595K} Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HiGraph's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at https://higraph.org.

Abstract PDF Upgrade to Chat

Summary

The paper introduces HiGraph, a dataset featuring over 200 million Control Flow Graphs nested within 595,000 Function Call Graphs for enhanced malware detection.
It employs a dual-layer hierarchical graph structure to capture both local and global semantic relationships, improving robustness against obfuscation.
Empirical analysis reveals that malware samples exhibit higher PageRank and cyclomatic complexity, guiding the development of advanced graph neural network models.

HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis

Introduction

The paper "HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis" addresses the critical limitations in graph-based malware detection caused by the lack of datasets that encapsulate the hierarchical structure inherent in software binaries. Existing datasets often resort to simplistic, flat graph representations failing to model the complex semantic relationships between function calls and the lower-level control flow. HiGraph emerges as a pioneering dataset in this space, exhibiting over 200 million Control Flow Graphs (CFGs) nested within approximately 595,000 Function Call Graphs (FCGs). This hierarchical organization facilitates robust and resilient malware detection approaches that withstand code obfuscation and evolution.

Figure 1: Evolution of CryptoLocker ransomware illustrates persistent malicious behavior detectable via graph-based analysis.

Dataset Construction

HiGraph's meticulous construction involved acquiring a vast array of Android applications from the AndroZoo repository, reflecting a temporal span from 2012 to 2022. The dataset comprises 595,211 distinct applications, labeled through an analysis of VirusTotal reports to classify them into benign or malicious categories. AVClass2 was employed for more granular classification into malware families.

The core of HiGraph is its dual-level graph structure, consisting of CFGs and FCGs. CFGs were extracted at the function level, capturing intra-procedural control logic, while FCGs detailed inter-procedural function call relationships within each application. This hierarchical graph architecture exceeds prior dataset scales, as substantiated by detailed comparative metrics with existing corpora.

Figure 2: Overview of the HiGraph construction pipeline, illustrating dataset collection, labeling, and graph extraction processes.

Empirical Analysis

The empirical analysis of HiGraph reveals distinct structural characteristics of benign and malicious samples, demonstrating their unique graph properties. Malware often exhibits higher PageRank values and cyclomatic complexities, indicating a more centralized and logically complex architecture in both CFGs and FCGs. Such structural insights guide the development of detection models that leverage these differences.

Temporal analysis showcases the evolution of malware families over a decade, noting a rapid emergence of new families peaking around 2019-2020. This temporal data consistency in HiGraph serves as a powerful tool for studying concept drift in malware detection.

Figure 3: Monthly distribution of top malware families, highlighting dynamic trends in prevalence over time.

Evaluating HiGraph for Malware Detection

The hierarchical graph structure of HiGraph lends itself to effective malware detection through advanced graph neural networks (GNNs). Hi-GNN, a proposed model exploiting HiGraph's hierarchical nature, significantly outperforms traditional GNNs in both binary and multi-class classification tasks. Hi-GNN's architecture, integrating separate CFG and FCG encoders, captures both local logic and global structural features, resulting in superior detection capabilities.

Figure 4: Correlation between FCG average PageRank and CFG cyclomatic complexity for malware applications, emphasizing central and complex function nodes.

Conclusion

HiGraph's introduction opens a new avenue for malware analysis grounded in hierarchical graph learning. By capturing both intra- and inter-procedural semantics, it enables the development and evaluation of sophisticated malware detection models resilient to temporal changes and obfuscation strategies. Through open access and comprehensive documentation, HiGraph stands as a critical resource poised to elevate the community's ability to design robust cybersecurity solutions.

Markdown