SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities (1807.06756v3)

Published 18 Jul 2018 in cs.LG, cs.AI, cs.CR, and stat.ML

Abstract: The detection of software vulnerabilities (or vulnerabilities for short) is an important problem that has yet to be tackled, as manifested by the many vulnerabilities reported on a daily basis. This calls for machine learning methods for vulnerability detection. Deep learning is attractive for this purpose because it alleviates the requirement to manually define features. Despite the tremendous success of deep learning in other application domains, its applicability to vulnerability detection is not systematically understood. In order to fill this void, we propose the first systematic framework for using deep learning to detect vulnerabilities in C/C++ programs with source code. The framework, dubbed Syntax-based, Semantics-based, and Vector Representations (SySeVR), focuses on obtaining program representations that can accommodate syntax and semantic information pertinent to vulnerabilities. Our experiments with 4 software products demonstrate the usefulness of the framework: we detect 15 vulnerabilities that are not reported in the National Vulnerability Database. Among these 15 vulnerabilities, 7 are unknown and have been reported to the vendors, and the other 8 have been "silently" patched by the vendors when releasing newer versions of the pertinent software products.

PDF Abstract

SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities

In the paper "SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities," the authors present a detailed and systematic framework that employs deep learning to identify vulnerabilities in C/C++ programs with source code. This framework, named SySeVR, stands out for its novel approach to representing program code in a manner that incorporates syntax and semantics crucial to vulnerability detection.

Framework Overview

The SySeVR framework focuses on addressing a significant gap in the application of deep learning for vulnerability detection, striving to represent programs in a way that allows models to generalize across a variety of vulnerabilities. This is achieved by introducing Syntax-based Vulnerability Candidates (SyVCs) and Semantics-based Vulnerability Candidates (SeVCs). These candidates enable the framework to capture relevant syntactic and semantic information, providing a rich context for neural networks to learn from.

Key Components and Methodology

SyVCs Extraction: The authors have identified four primary syntax characteristics types—Library/API Function Call (FC), Array Usage (AU), Pointer Usage (PU), and Arithmetic Expression (AE)—from existing tools and vulnerability databases. By automating the extraction of these characteristics through Abstract Syntax Trees (ASTs), the framework efficiently identifies potential vulnerabilities.
SyVC to SeVC Transformation: This crucial step enriches the extracted syntax information by incorporating semantic relationships through data and control dependencies, captured using Program Dependency Graphs (PDGs). The transformation from SyVCs to SeVCs ensures that relevant code portions are preserved and analyzed in the context of wider program execution paths.
Vector Representation and Model Training: SeVCs are encoded into vector representations using word2vec, facilitating the training of different types of deep learning models, including CNNs, RNNs, and specifically bidirectional RNNs such as BLSTM and BGRU. These models leverage the comprehensive information encapsulated in SeVCs to detect vulnerabilities more effectively than traditional methods.

Experimental Validation

The authors report extensive experiments using a newly constructed dataset comprising a diverse array of vulnerabilities, collected from NVD and SARD. This dataset includes 126 types of vulnerabilities, which importantly span multiple syntax characteristic categories.

The results indicate that SySeVR-enabled models, especially those leveraging BGRU, significantly outperform existing state-of-the-art vulnerability detection mechanisms, including commercial tools like Checkmarx and open-source solutions such as Flawfinder and RATS. The detection accuracy benefits greatly from the enriched context provided by both data and control dependencies, as evidenced by the experiments showing reduced false-negative rates when both types of dependencies are considered.

Implications and Future Directions

The implications of this research are notable. The framework sets a precedent for enhancing deep learning's applicability across broader domains of software security, potentially inspiring future models that focus on higher-level abstract representations of code. The adaptability of SySeVR to different types of neural networks and the flexibility in targeting various syntax and semantic features indicates a robust foundation for further exploration in automated vulnerability detection.

The paper also identifies areas for enhancement, such as the need for even wider coverage of vulnerability types and improvement in the precision of vulnerability location prediction. Moreover, leveraging co-training or ensemble methods could address potential limitations in ground-truth label generation and enhance model robustness.

In conclusion, SySeVR represents a substantive advancement in the automated detection of software vulnerabilities. By creatively integrating syntax and semantics into the deep learning framework, the authors provide a comprehensive tool that promises improved security postures for software systems. This research highlights the transformative potential of AI in cybersecurity, paving the way for more intelligent, adaptive, and precise threat detection technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zhen Li (334 papers)
Deqing Zou (12 papers)
Shouhuai Xu (65 papers)
Hai Jin (83 papers)
Yawei Zhu (2 papers)
Zhaoxuan Chen (2 papers)

Citations (439)

View on Semantic Scholar