Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE (2412.14306v3)

Published 18 Dec 2024 in cs.SE, cs.CR, and cs.LG

Abstract: This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural-language explanations for alerts and fixes, leveraging chat interfaces. We recruited 17 professional software developers at Microsoft, observed their usage of the tool on their code, and conducted interviews to assess the tool's usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users' perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user's codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at https://doi.org/10.6084/m9.figshare.26367139.

Summary

The paper presents an empirical user study evaluating the real-world effectiveness and limitations of an AI-powered tool (DeepVulGuard) for vulnerability detection and repair integrated into the IDE.
Key findings indicate that high false positive rates, even with reasonable benchmark precision, significantly challenge developer trust and the perceived usefulness of current AI detection methods in complex codebases.
Practical application is hindered by AI-suggested fixes lacking code-specific context and manual scan initiation disrupting developer workflows, highlighting the need for better integration and customization.

An Empirical Study on AI-Powered Vulnerability Detection in the IDE

The paper "Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection and Repair in the IDE" presents an empirical investigation into the practical application of AI-based tools for identifying and fixing software vulnerabilities. This study is significant as it transitions the exploration of deep learning models from theoretical benchmark tests to real-world software development environments, offering insights into their practical utility and limitations.

Overview of DeepVulGuard

The researchers introduced DeepVulGuard, an integrated development environment (IDE) tool that employs state-of-the-art deep learning models, namely CodeBERT and GPT-4, to detect and suggest fixes for security vulnerabilities in source code. DeepVulGuard supports developers by providing natural language explanations, fix suggestions, and an interactive chat interface. This tool was integrated into Visual Studio Code with the capability to scan code for vulnerabilities, localize faults, classify vulnerability types, and offer actionable insights.

Methodology

To assess the effectiveness of DeepVulGuard, a user study was conducted with 17 professional software developers at Microsoft. The participants applied the tool to 24 real-world projects, scanning over 1.7 million lines of code, which resulted in 170 alerts and 50 fix suggestions. The study focused on several tools' dimensions: usefulness, speed, trust, relevance, and workflow integration. The researchers utilized both quantitative data from survey responses and qualitative feedback from interviews to gain a comprehensive understanding of the tool's performance in real-world scenarios.

Key Findings

Precision and False Positives

The study highlighted that while DeepVulGuard showed promise, its practical application was limited by a high rate of false positives. The models achieved 80% precision but had a recall rate of only 32% on the SVEN dataset. In real-world settings, the false positive rate was higher, attributed to inadequate context incorporation during the analysis, such as interprocedural vulnerabilities and environment-specific details. This indicates that despite acceptable precision, the models often produced irrelevant or incorrect alerts, challenging the developers' trust and perceived usefulness of the tool.

Fix Suggestions

For fix suggestions, 25% of the examples were deemed useful or minimally problematic by participants. However, many suggested fixes lacked customization to the developers' specific codebase, preventing straightforward application. Developers often found the AI's suggestions incompatible with existing project patterns or implemented incorrect approaches to fix vulnerabilities.

Workflow Integration and Recommendations

A significant barrier to the tool's adoption was the need for manual scan initiation, disrupting developers' workflows. The study participants expressed a preference for a background scanning feature, which would automatically run alongside code editing or build processes. This feedback suggests that integrating automation could enhance the tool's usability and seamlessly integrate it into existing development practices.

Implications and Future Directions

The study underscores the potential of AI-driven tools like DeepVulGuard to improve software security, yet it reveals critical areas where improvement is necessary. The high false positive rate and the lack of contextually relevant fix suggestions point to the need for more robust detection mechanisms and a deeper integration of contextual information into analysis models. Additionally, incorporating interactive features, such as chat interfaces for refining fixes, highlights the importance of user-centered design in tool development.

From a broader perspective, the findings suggest that while AI has the potential to enhance vulnerability detection and mitigation significantly, practical deployment requires addressing the nuanced demands and workflows of software developers. Future research should focus on refining AI models to reduce false positives, enhance precision across diverse codebases, and develop sophisticated fixes that are compatible with a wide range of programming environments and styles.

The availability of DeepVulGuard's code and data opens opportunities for further research and experimentation, fostering advancements in AI-driven security tools that can be both effective and practical for real-world software engineering.