eyeballvul: a future-proof benchmark for vulnerability detection in the wild (2407.08708v2)

Published 11 Jul 2024 in cs.CR, cs.AI, and cs.LG

Abstract: Long contexts of recent LLMs have enabled a new use case: asking models to find security vulnerabilities in entire codebases. To evaluate model performance on this task, we introduce eyeballvul: a benchmark designed to test the vulnerability detection capabilities of LLMs at scale, that is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories. The benchmark consists of a list of revisions in different repositories, each associated with the list of known vulnerabilities present at that revision. An LLM-based scorer is used to compare the list of possible vulnerabilities returned by a model to the list of known vulnerabilities for each revision. As of July 2024, eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.

PDF HTML Abstract

An Evaluation of "eyeballvul: a future-proof benchmark for vulnerability detection in the wild"

Introduction

The paper "eyeballvul: a future-proof benchmark for vulnerability detection in the wild" introduces a benchmark designed to evaluate the effectiveness of LLMs in detecting software vulnerabilities at scale. This benchmark, called eyeballvul, sources and updates vulnerability data weekly from open-source repositories. It aims to fill the gap in evaluating model performance for large-scale security vulnerability detection in real-world codebases, leveraging LLMs.

Benchmark Design and Creation

Benchmark Attributes

The benchmark is notable for several key attributes:

Real-world vulnerabilities: It sources vulnerabilities directly from CVE records in open-source repositories.
Realistic detection settings: Eyeballvul emulates a practical deployment scenario unlike many traditional classification-based datasets.
Scale and diversity: The benchmark comprises over 24,000 vulnerabilities across more than 6,000 revisions from over 5,000 repositories, totaling approximately 55GB in size.
Future-proofing: Weekly updates ensure that the benchmark remains current, addressing potential training data contamination issues.

Development Procedure

The benchmark construction involves a detailed process:

Downloading CVEs related to open-source repositories from the OSV dataset.
Grouping CVEs by repository, and for each CVE, identifying affected versions.
Selecting a minimal representative set of versions to include all vulnerabilities using Google's CP-SAT solver.
Switching to the specific repository revisions and computing their sizes and language breakdowns.

Evaluation Methodology

Processing and Running LLMs

To test different LLMs, the paper introduces a method of processing revisions where models are instructed to find vulnerabilities within manageable chunks of the codebase. They exclude certain file types and set size limitations to ensure chunks fit within model context windows.

Scoring Vulnerabilities

An LLM-based scoring method aligns returned potential vulnerabilities with known vulnerabilities. This approach measures metrics like precision and recall. Despite relying on LLM scoring, the authors emphasize the difficulty due to often terse and imprecise CVE descriptions.

Results and Analysis

Overall Performance

The evaluation of seven state-of-the-art long-context LLMs such as Claude 3 Haiku, GPT-4 Turbo, and Gemini 1.5 Pro demonstrates varying levels of performance:

The highest F1 scores were 14.1% (Claude 3 Opus) and 13.1% (Claude 3.5 Sonnet).
The models showed different precision and recall tradeoffs, with Gemini 1.5 Pro exhibiting high precision but low recall, and GPT-4o showing the opposite trend.
Absolute performance remained modest, highlighting significant room for improvement in both precision (best at 19.6%) and recall (best at 14.1%).

Vulnerability Types and Severities

The analysis revealed that models performed better on superficial vulnerabilities, such as injection vulnerabilities, while showing weaknesses in detecting memory corruption vulnerabilities (e.g., out-of-bounds write/read, use after free). The detected vulnerabilities also tended to be more severe than average, with an overrepresentation of critical issues.

Costs

The primary cost driver for vulnerability detection using these models is false positives. The paper estimated that the false positive rates led to substantial downstream costs due to misallocation of developer resources.

Implications and Future Work

Practical and Theoretical Impact

This benchmark provides a robust way to evaluate LLMs in real-world vulnerability detection, promoting advancements in practical security tools. The development of models and tool chains that reduce false positives and enhance lead deduplication could notably improve detection efficiency.

Future Directions

Future research could focus on:

Refining the scoring mechanism to improve accuracy and address the data quality issues inherent in CVEs.
Exploring how LLMs can be better integrated with existing tools like fuzzing or symbolic execution to improve precision and recall.
Innovating automated ways to identify new, unpublished vulnerabilities as models advance.

Conclusion

Eyeballvul represents a comprehensive and continuously updated benchmark that challenges long-context LLMs in the domain of vulnerability detection at scale. While current results indicate significant room for improvement, the benchmark's extensive dataset and realistic setting provide a critical foundation for ongoing advancements. This work underscores the dual-use nature of such technologies, advocating for advancements that advantage defenders disproportionately over attackers by leveraging LLMs effectively in cybersecurity.