Malware Classification based on Call Graph Clustering (1008.4365v1)

Published 25 Aug 2010 in cs.CR

Abstract: Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DBSCAN. Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.

Citations (203)

View on Semantic Scholar

Summary

The paper classifies malware by representing samples as call graphs and applying clustering, finding DBSCAN effective for grouping by structural similarity.
Experiments demonstrate the method accurately identifies malware families based on structural patterns, aiding automated classification and generic signatures.
This offers a novel framework for assessing software similarity beyond malware, with potential for real-time integration and automated protection.

Malware Classification Based on Call Graph Clustering

The paper presented by Joris Kinable and Orestis Kostakis introduces an approach for malware classification utilizing call graph clustering, signifying an advancement in automatic and robust detection methods to handle the rapidly increasing number of malicious software samples. This work targets the structural abstraction of malware samples into call graphs, allowing for the detection of inherent structural similarities between different samples.

Key Elements of the Approach

The core premise lies in representing malware samples as call graphs, where functions within the executable are vertices, and function calls are directed edges. This abstraction helps eliminate variations and highlights structural patterns that can be useful in clustering and detection. The methodology facilitates the grouping of malware samples not merely by surface signatures but by deeper structural resemblances.

Call Graph Representation: Malware samples are transformed into directed graphs through static analysis, decomposing binaries into components that reflect both local and external functions. This process involves disassembly tools, such as IDA Pro, and the subsequent handling of obfuscation techniques that malware might employ.
Graph Matching and Similarity: The paper defines graph similarity through graph edit distance (GED), measuring the minimal operation cost required to transform one graph into another. The algorithm employed is a variant of Simulated Annealing tailored to efficiently compute GED for the context of call graphs, proving superior in speed and accuracy when compared to Munkres' algorithm-based implementations.
Clustering Algorithms: The research explores several clustering techniques, notably $k$ -medoids and DBSCAN, to categorize similar malware efficiently. The analysis emphasizes that while $k$ -medoids found clusters corresponding to manually labeled families reasonably well, the density-based approach of DBSCAN emerged as more effective, specifically for identifying dense, well-separated clusters among malware samples.

Experimental Results and Implications

Empirical experiments conducted on real-world malware collections underscore the capacity of call graph clustering to discern malware families accurately. Although the $k$ -medoids approach faced challenges in determining optimal cluster quantities and cohesion, DBSCAN provided more reliable classifications by focusing on sample density, effectively identifying and isolating significant clusters.

The paper’s implications reach both practical and theoretical domains in cybersecurity. Theoretically, it presents a novel framework for similarity assessment between software elements, potentially applicable beyond the field of malware detection. Practically, the automation proposed here aids in classifying new samples against existing families, thereby enabling more generalized and proactive malware detection systems. This has the potential to increase efficiency in malware analysis and the creation of generic signatures, drastically reducing the reliance on human-driven pattern recognition.

Future Directions

The paper hints at continuous development in several domains such as integrating this methodology directly into systems handling a live stream of malware for real-time classification. Furthermore, automating the protective mechanism against newly recognized malware families stands as a promising horizon, potentially setting new standards in responsive cybersecurity strategies.

In summary, by presenting a methodology that fundamentally rethinks malware classification through the lens of graph theory, this research contributes to the broader discourse on automated and precise cybersecurity measures. Future research could expand into refining call graph analysis techniques and exploring larger-scale integration and real-time applications of these findings within the cybersecurity infrastructure.

PDF Markdown