Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features (1508.03096v2)

Published 13 Aug 2015 in cs.CR

Abstract: Malware remains a serious problem for corporations, government agencies, and individuals, as attackers continue to use it as a tool to effect frequent and costly network intrusions. Machine learning holds the promise of automating the work required to detect newly discovered malware families, and could potentially learn generalizations about malware and benign software that support the detection of entirely new, unknown malware families. Unfortunately, few proposed machine learning based malware detection methods have achieved the low false positive rates required to deliver deployable detectors. In this paper we a deep neural network malware classifier that achieves a usable detection rate at an extremely low false positive rate and scales to real world training example volumes on commodity hardware. Specifically, we show that our system achieves a 95% detection rate at 0.1% false positive rate (FPR), based on more than 400,000 software binaries sourced directly from our customers and internal malware databases. We achieve these results by directly learning on all binaries, without any filtering, unpacking, or manually separating binary files into categories. Further, we confirm our false positive rates directly on a live stream of files coming in from Invincea's deployed endpoint solution, provide an estimate of how many new binary files we expected to see a day on an enterprise network, and describe how that relates to the false positive rate and translates into an intuitive threat score. Our results demonstrate that it is now feasible to quickly train and deploy a low resource, highly accurate machine learning classification model, with false positive rates that approach traditional labor intensive signature based methods, while also detecting previously unseen malware.

Deep Neural Network-Based Malware Detection Using Two-Dimensional Binary Program Features

The paper "Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features" presents a novel approach to malware detection leveraging deep learning techniques. Authored by Joshua Saxe and Konstantin Berlin, the paper addresses the persistent challenge of malware detection in computer security by proposing a method that combines static feature extraction with deep neural networks, achieving high accuracy with low false positive rates.

Summary of Methodology

The authors introduce a classification framework comprising three main components: feature extraction, a deep neural network classifier, and a score calibrator. The feature extraction involves constructing a 1024-dimensional vector from four types of features derived from binary files. These features include:

  • Byte/Entropy Histogram Features: Utilizes a two-dimensional histogram based on byte occurrences and entropy levels.
  • PE Import Features: Considers the import address table of binaries.
  • PE Metadata Features: Extracts numerical fields from the PE files of Windows binaries.
  • String Features: Analyzes textual strings within the binaries.

The neural network consists of four layers, utilizing advanced techniques such as dropout and PReLU activations to counter overfitting and enhance convergence speed. The threat score is computed through Bayesian calibration to convert network outputs into a realistic probability of malware presence, integrating empirical network risk profiles with network scores.

Evaluation and Results

The framework's efficacy was evaluated using a dataset of over 400,000 binaries, sourced from Invincea's endpoints and malware databases. The results indicate a 95% detection rate at an impressively low false positive rate of 0.1%. Furthermore, the system demonstrates scalability as it utilizes commodity hardware while processing large data volumes.

The authors also conducted experiments comparing various feature sets independently and in combination, clearly demonstrating that combining all feature sets enhances performance significantly. The cross-validated model not only outperformed individual feature sets but also maintained robustness under time-split tests, illustrating its capacity to generalize across newly emerged malware.

Implications and Future Work

The findings imply substantial potential for deployment in real-world applications, offering an efficient and scalable solution that rivals traditional signature-based methods while providing the flexibility and adaptability of machine learning models. The authors suggest continuous updates with new data to maintain the classifier's accuracy, leveraging neural networks' capabilities for incremental learning.

The framework's success underlines the feasibility of integrating machine learning models into layered network defense strategies. As computational resources and data availability increase, these models could further adapt and improve over time, bolstering defense mechanisms against evolving cyber threats.

In conclusion, this paper contributes a significant advancement in the automated detection of malware, setting a foundation for future exploration and refinement within the domain of AI-driven cybersecurity solutions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Joshua Saxe (15 papers)
  2. Konstantin Berlin (12 papers)
Citations (601)