An Evaluation of "eyeballvul: a future-proof benchmark for vulnerability detection in the wild"
Introduction
The paper "eyeballvul: a future-proof benchmark for vulnerability detection in the wild" introduces a benchmark designed to evaluate the effectiveness of LLMs in detecting software vulnerabilities at scale. This benchmark, called eyeballvul, sources and updates vulnerability data weekly from open-source repositories. It aims to fill the gap in evaluating model performance for large-scale security vulnerability detection in real-world codebases, leveraging LLMs.
Benchmark Design and Creation
Benchmark Attributes
The benchmark is notable for several key attributes:
- Real-world vulnerabilities: It sources vulnerabilities directly from CVE records in open-source repositories.
- Realistic detection settings: Eyeballvul emulates a practical deployment scenario unlike many traditional classification-based datasets.
- Scale and diversity: The benchmark comprises over 24,000 vulnerabilities across more than 6,000 revisions from over 5,000 repositories, totaling approximately 55GB in size.
- Future-proofing: Weekly updates ensure that the benchmark remains current, addressing potential training data contamination issues.
Development Procedure
The benchmark construction involves a detailed process:
- Downloading CVEs related to open-source repositories from the OSV dataset.
- Grouping CVEs by repository, and for each CVE, identifying affected versions.
- Selecting a minimal representative set of versions to include all vulnerabilities using Google's CP-SAT solver.
- Switching to the specific repository revisions and computing their sizes and language breakdowns.
Evaluation Methodology
Processing and Running LLMs
To test different LLMs, the paper introduces a method of processing revisions where models are instructed to find vulnerabilities within manageable chunks of the codebase. They exclude certain file types and set size limitations to ensure chunks fit within model context windows.
Scoring Vulnerabilities
An LLM-based scoring method aligns returned potential vulnerabilities with known vulnerabilities. This approach measures metrics like precision and recall. Despite relying on LLM scoring, the authors emphasize the difficulty due to often terse and imprecise CVE descriptions.
Results and Analysis
Overall Performance
The evaluation of seven state-of-the-art long-context LLMs such as Claude 3 Haiku, GPT-4 Turbo, and Gemini 1.5 Pro demonstrates varying levels of performance:
- The highest F1 scores were 14.1% (Claude 3 Opus) and 13.1% (Claude 3.5 Sonnet).
- The models showed different precision and recall tradeoffs, with Gemini 1.5 Pro exhibiting high precision but low recall, and GPT-4o showing the opposite trend.
- Absolute performance remained modest, highlighting significant room for improvement in both precision (best at 19.6%) and recall (best at 14.1%).
Vulnerability Types and Severities
The analysis revealed that models performed better on superficial vulnerabilities, such as injection vulnerabilities, while showing weaknesses in detecting memory corruption vulnerabilities (e.g., out-of-bounds write/read, use after free). The detected vulnerabilities also tended to be more severe than average, with an overrepresentation of critical issues.
Costs
The primary cost driver for vulnerability detection using these models is false positives. The paper estimated that the false positive rates led to substantial downstream costs due to misallocation of developer resources.
Implications and Future Work
Practical and Theoretical Impact
This benchmark provides a robust way to evaluate LLMs in real-world vulnerability detection, promoting advancements in practical security tools. The development of models and tool chains that reduce false positives and enhance lead deduplication could notably improve detection efficiency.
Future Directions
Future research could focus on:
- Refining the scoring mechanism to improve accuracy and address the data quality issues inherent in CVEs.
- Exploring how LLMs can be better integrated with existing tools like fuzzing or symbolic execution to improve precision and recall.
- Innovating automated ways to identify new, unpublished vulnerabilities as models advance.
Conclusion
Eyeballvul represents a comprehensive and continuously updated benchmark that challenges long-context LLMs in the domain of vulnerability detection at scale. While current results indicate significant room for improvement, the benchmark's extensive dataset and realistic setting provide a critical foundation for ongoing advancements. This work underscores the dual-use nature of such technologies, advocating for advancements that advantage defenders disproportionately over attackers by leveraging LLMs effectively in cybersecurity.