Papers
Topics
Authors
Recent
2000 character limit reached

Automated software vulnerability detection with machine learning

Published 14 Feb 2018 in cs.SE, cs.LG, and stat.ML | (1803.04497v2)

Abstract: Thousands of security vulnerabilities are discovered in production software each year, either reported publicly to the Common Vulnerabilities and Exposures database or discovered internally in proprietary code. Vulnerabilities often manifest themselves in subtle ways that are not obvious to code reviewers or the developers themselves. With the wealth of open source code available for analysis, there is an opportunity to learn the patterns of bugs that can lead to security vulnerabilities directly from data. In this paper, we present a data-driven approach to vulnerability detection using machine learning, specifically applied to C and C++ programs. We first compile a large dataset of hundreds of thousands of open-source functions labeled with the outputs of a static analyzer. We then compare methods applied directly to source code with methods applied to artifacts extracted from the build process, finding that source-based models perform better. We also compare the application of deep neural network models with more traditional models such as random forests and find the best performance comes from combining features learned by deep models with tree-based models. Ultimately, our highest performing model achieves an area under the precision-recall curve of 0.49 and an area under the ROC curve of 0.87.

Citations (153)

Summary

  • The paper introduces a hybrid approach that integrates source-based and build-based feature extraction to detect vulnerabilities in C/C++ programs.
  • It demonstrates that models combining lexical features (e.g., TextCNN with word2vec) and structural insights yield superior precision-recall performance.
  • The combined model approach effectively synthesizes control flow graphs with token embeddings to enhance vulnerability detection accuracy.

Automated Software Vulnerability Detection with Machine Learning

This paper presents a robust approach to leveraging machine learning techniques for automated detection of software vulnerabilities, particularly within C and C++ programs. The approach integrates source-based and build-based feature extraction to optimize the detection of potential security flaws at the function level.

Introduction

Software vulnerabilities are persistent challenges in sustaining secure systems, often exploited to cause significant disruption and damage. As traditional static and dynamic analysis tools provide limited rule-based detection, the shift toward data-driven methodologies offers a broader scope for identifying vulnerabilities by learning from extensive open-source datasets. This paper introduces a machine learning framework that capitalizes on the available open-source C/C++ code to discern vulnerability patterns beyond the capabilities of static analysis tools.

Methodology

Feature Extraction

Build-Based Features: Utilizing LLVM and Clang, the build-based approach extracts features from program intermediate representations (IR) during compilation. This includes Control Flow Graphs (CFG), opcode vectors, and use-def matrices, which capture operational details and data flows conducive to vulnerability detection. The CFG representation provides a macro-level understanding of possible execution flows, while details like opcode vectors reflect micro-level operations. Figure 1

Figure 1: A control flow graph extracted from a simple example code snippet.

Source-Based Features: A custom lexer processes the source code to generate tokens, categorized as literals, operators, names, etc. The paper employs bag-of-words and word2vec embeddings to convert these tokens into structured data inputs for machine learning models, where semantic relationships between tokens can be effectively learned. Figure 2

Figure 2: An example illustrating the lexing process.

Figure 3

Figure 3: A word2vec embedding of tokens from C/C++ source code.

Model Implementation

The evaluation centers around three source-based models: bag-of-words combined with extremely randomized trees, TextCNN using word2vec embeddings, and a hybrid model that injects TextCNN-derived features into an extra-trees classifier, demonstrating the best performance. For build-based features, handcrafted vectors from CFGs and opcodes feed into a random forest model. Consequently, a combined model synthesizes both source and build-based features for optimal security flaw detection. Figure 4

Figure 4: Illustration of the three types of source-based models tested.

Experimental Results

The experiments utilize two primary datasets from Debian and GitHub, ensuring no duplicates across training, validation, and test sets. Source-based models, due to their enriched lexical input, outperform build-based models, with the hybrid TextCNN and extra-trees model showcasing the highest precision-recall area under the curve (AUC). Figure 5

Figure 5

Figure 5: Results of testing different source-based models on a combined Debian+Github source code dataset.

Build-based models achieve significant yet lower AUC values, reinforcing the conclusion that direct source code features offer superior context and information for vulnerability detection.

Comparative Analysis

To validate the efficacy of combining methodologies, the paper also presents consolidated testing across identical datasets, which reveals that combined models incorporate build-based positional and structural insights with source-based lexical intelligence to achieve superior detection performance. Figure 6

Figure 6

Figure 6: Comparison of build only, source only, and combined models on the Github dataset.

Conclusions

This research delineates a clear path towards enhancing software vulnerability detection through machine learning by integrating language-level src features with build-level IR insights. While static analysis benchmarks frame the current results, future work should aim to diversify training labels such as incorporating real-world exploits to enrich model learning. This direction promises expedited code reviews and efficient vulnerability management, augmenting conventional analysis tools with predictive accuracy and breadth. Additionally, expanding models to other programming languages holds potential for broader impact.

Overall, this paper lays a foundational framework for using machine learning to predict vuln presence in software, underscoring the importance of hybrid model approaches for comprehensive analysis.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.