Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Published 29 Jul 2025 in cs.CR and cs.SE | (2507.21817v4)

Abstract: Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Prior work found that current vulnerability datasets suffer from issues including label inaccuracy rates of 20%-71%, extensive duplication, and poor coverage of critical Common Weakness Enumeration (CWE). These issues create a significant generalization gap where models achieve misleading In-Distribution (ID) accuracies (testing on splits from the same dataset) by exploiting spurious correlations rather than learning true vulnerability patterns. To address these limitations, we present a three-part solution. First, we introduce BenchVul, which is a manually curated and balanced test dataset covering the MITRE Top 25 Most Dangerous CWEs, to enable fair model evaluation. Second, we construct a high-quality training dataset, TitanVul, comprising 38,548 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM pipeline. Third, we propose a Realistic Vulnerability Generation (RVG) pipeline, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation reveals that In-Distribution (ID) performance does not reliably predict Out-of-Distribution (OOD) performance on BenchVul. For example, a model trained on BigVul achieves the highest 0.703 ID accuracy but fails on BenchVul's real-world samples (0.493 OOD accuracy). Conversely, a model trained on our TitanVul achieves the highest OOD performance on both the real-world (0.881) and synthesized (0.785) portions of BenchVul, improving upon the next-best performing dataset by 5.3% and 11.8% respectively, despite a modest ID score (0.590). Augmenting TitanVul with our RVG further boosts this leading OOD performance, improving accuracy on real-world data by 5.8% (to 0.932).

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that models trained on TitanVul improve accuracy from 0.584 to 0.767 and reach 0.874 with RVG data.
It introduces BenchVul, a rigorously curated benchmark for the Top 25 CWEs, effectively addressing issues like label inaccuracies and data duplication.
The methodology employs a multi-agent verification framework to generate realistic vulnerability examples and enhance training data quality.

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Introduction

The paper examines the efficacy of LLMs in detecting vulnerabilities classified under the MITRE Top 25 Most Dangerous Common Weakness Enumerations (CWEs). Despite advancements in automated vulnerability detection, practical deployment remains limited due to substantial data quality issues in prevalent datasets. These issues, including high rates of label inaccuracy and extensive duplication, create a significant "generalization gap" by allowing models to learn spurious correlations rather than genuine vulnerability patterns. The authors propose a multi-faceted solution to bridge this gap, involving the creation of a new benchmark, BenchVul, and a high-quality training dataset, TitanVul, alongside a framework for realistic vulnerability generation.

Methodology

BenchVul Construction: BenchVul is a manually curated dataset focused on the Top 25 CWEs. It involves aggregating data from multiple public sources, followed by rigorous deduplication and filtering processes to ensure data quality. The benchmark aims to provide a balanced distribution of vulnerabilities, especially for underrepresented but critical CWE types.

TitanVul Dataset: TitanVul is a comprehensive training dataset that employs a novel multi-agent LLM verification framework for deduplication and validation. This framework ensures the inclusion of genuine and self-contained vulnerabilities at the function level, optimizing the quality of training data.

Realistic Vulnerability Generation (RVG): The RVG framework synthesizes realistic, context-aware vulnerability examples. It simulates development workflows to generate vulnerabilities for those CWE types that are typically underrepresented. This synthesis process utilizes a multi-agent system to maintain realism and relevance.

Figure 1: Overview of the BenchVul construction pipeline for the MITRE Top 25 Most Dangerous CWEs.

Evaluation

The empirical analysis demonstrates that models trained on standard datasets suffer significant performance degradation when evaluated on independent data, revealing a substantial generalization gap. For instance, models trained on BigVul and PrimeVul showed a performance drop from accuracies of 0.776 to 0.519 and 0.567 to 0.337 on BenchVul, respectively. In contrast, TitanVul improved model generalization, enhancing performance from a self-testing accuracy of 0.584 to 0.767 on BenchVul. Supplementing TitanVul with RVG-generated data further boosted performance by 14%, achieving an accuracy of 0.874.

Figure 2: Overview of the multi-agent LLM verification pipeline used to construct TitanVul.

Implications and Future Work

The findings imply that high-quality, verified datasets like TitanVul can significantly enhance the generalization capability of LLMs in vulnerability detection. The research highlights the need for reliable benchmarks, like BenchVul, to test real-world applicability. Future work involves extending the datasets to cover more CWEs and exploring inter-procedural vulnerability detection. Facilitating broader applications of these benchmarks to industrial-scale projects remains a primary objective.

Conclusion

This paper contributes significantly to vulnerability detection research by addressing the fundamental challenges of dataset quality that affect model generalization. The introduction of BenchVul and TitanVul, along with the RVG framework, provides a new standard for evaluating and training LLMs, promising more robust and reliable vulnerability detection systems. The research opens avenues for future exploration into expanding dataset coverage and enhancing real-world applicability of automated vulnerability detection systems.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (19)

First 10 authors:

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Summary

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Introduction

Methodology

Evaluation

Implications and Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (19)

Collections