TitanVul: Benchmark for Vulnerability Detection

Updated 24 March 2026

TitanVul is a large-scale benchmark for vulnerability detection in C/C++ code, featuring rigorously curated function-level pairs and multi-agent LLM validation.
It aggregates data from seven public datasets and employs a two-phase deduplication and LLM-driven validation to ensure high-quality, balanced coverage of critical CWE classes.
TitanVul supports robust model evaluations with both in-distribution and out-of-distribution splits, highlighting significant improvements using context-aware detection methods.

TitanVul is a large-scale, balanced vulnerability detection benchmark and training set for C/C++ code, designed to address historical limitations of label noise, duplication, and insufficient coverage across critical Common Weakness Enumeration (CWE) classes. By aggregating and rigorously filtering function-level vulnerability/fix pairs from diverse public datasets, TitanVul enables robust development, benchmarking, and generalization assessment for ML and LLM approaches to vulnerability detection. The corpus features multi-agent LLM-verified security labels, high coverage of real-world CWE types, and is accompanied by strong evaluation protocols utilizing both in-distribution (ID) and out-of-distribution (OOD) splits.

1. Construction and Composition

TitanVul is constructed by aggregating seven public vulnerability datasets: BigVul, CleanVul, CVEfixes, DiverseVul, PrimeVul, SafeCoder, and VulnPatchPairs (Li et al., 29 Jul 2025). Each record is a function-level code example, where the “vulnerable” class is drawn from a function before a security fix and the “non-vulnerable” (fix) is the same function after patching. The initial pool contains 304,726 function pairs.

Deduplication proceeds in two phases:

Complete-pair duplication: Exact vulnerable/fix pairs from multiple sources are compared using AST normalization and subtree matching, removing 22,807 pairs (7.48%) while preserving the copy with maximal metadata.
Self-identical duplication: Pairs in which the vulnerable function is identical to its fix are removed, eliminating 181,183 pairs (64.28%).

After deduplication, the merged pool totals 100,736 distinct pairs; subsequent validation (see below) yields a final corpus of 38,548 vulnerable functions, each uniquely paired with its security-fixed variant, for 77,096 binary-labeled function examples. A time-aware split ensures that validation and test records post-date all training items.

A balanced experimental subset—used for direct benchmarking—comprises 13,038 functions (6,519 per class: vulnerable and non-vulnerable), after restricting code length to ≤1,024 tokens to align with common LLM context windows (Li et al., 6 Feb 2026).

2. Data Cleaning and Validation

To address prior issues of label noise and misaligned fixes, TitanVul employs a multi-agent LLM-driven validation pipeline (Li et al., 29 Jul 2025). For each candidate pair, three roles are instantiated as prompt-chained LLM agents:

Vulnerability Auditor: Reviews the code diff, CWE assignment, commit message, and associated CVE (if present), providing arguments and evidence for whether the patch constitutes a security fix.
Vulnerability Critic: Challenges the Auditor's reasoning, identifying weak spots, false positives (e.g., refactorings or stylistic changes), and requests clarification.
Vulnerability Consensus: Aggregates the prior outputs, assigning a “possibility score” in {0,1,2,3}; only examples scoring ≥2 are admitted.

All agents operate with explicit “chain-of-thought” prompting for structured reasoning. Human quality assurance follows: six researchers manually audited 400 randomly sampled pairs (post-LLM), verifying semantic validity, CWE-fix alignment, and functional self-containment. The audit found 94% validity with inter-rater Cohen's κ = 0.424, indicating moderate agreement.

3. Vulnerability and Metadata Coverage

TitanVul comprehensively covers the MITRE Top 25 Most Dangerous CWEs as well as additional classes, with the number of vulnerable functions per CWE spanning:

CWE ID	Vulnerability Type	Vulnerable Examples
787	Out-of-bounds write	1,846
20	Improper input validation	1,734
119	Buffer overflow	1,520
125	Out-of-bounds read	1,432
79	Cross-site scripting	968
...	(full list in data)	...

Each TitanVul record includes: the vulnerable (pre-fix) function, its fixed variant, CWE identifiers, CVE reference (if present), commit message, dataset source, and timestamp. This extensive metadata facilitates context-aware modeling, time-aware splits, and reproducibility.

4. Benchmarking Protocols and Evaluation Metrics

TitanVul supports both standard function-level benchmarking and context-augmented (inter-procedural) evaluation. For binary classification, the primary evaluation metric is accuracy, defined as

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

where TP, TN, FP, FN are true/false positive/negative counts.

Precision, recall, and $F_1$ are also computed:

$\text{Precision} = \frac{TP}{TP + FP}, \qquad \text{Recall} = \frac{TP}{TP + FN}, \qquad F_1 = 2\, \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

TitanVul is constructed to support both In-Distribution (ID) evaluations (where train/test are drawn from the same distribution) and Out-of-Distribution (OOD) evaluations, where model generalization to novel, real-world vulnerabilities is assessed using external datasets such as BenchVul (Li et al., 29 Jul 2025).

5. Model Performance and Generalization

Performance on TitanVul reveals significant distinctions between context-free and context-aware methodologies (Li et al., 6 Feb 2026):

Zero-shot GPT-4.1 baseline: Accuracy = 51.36%, Precision = 50.90%, Recall = 76.74%, F₁ = 61.21%.
Function-level fine-tuned baselines:
- CodeBERT: Accuracy = 54.41%, F₁ = 54.91%.
- UniXcoder: Accuracy = 63.68%, F₁ = 66.10%.
CPRVul (context profiling + reasoning fine-tuning):
- Qwen2.5-7B: Accuracy = 68.46%, F₁ = 67.81%.
- Qwen2.5-32B: Accuracy = 73.76%, F₁ = 74.54%.

Relative to UniXcoder, CPRVul (Qwen2.5-32B) shows a substantial accuracy gain (+10.08 points; 63.68% → 73.76%) and $F_1$ uplift (+10.89 points; 66.10% → 74.54%) on the TitanVul benchmark.

OOD performance is a critical differentiator. When models trained on TitanVul are evaluated on the independent BenchVul benchmark, the Qwen2.5-Coder-1.5B model obtains:

Acc $_{ID}$ (TitanVul): 0.590 ± 0.003
Acc $_{OOD}$ (BenchVul Real): 0.881 ± 0.026
Acc $_{OOD}$ (BenchVul Synth): 0.785 ± 0.007

For comparison, training on BigVul yields substantially higher ID but much lower OOD results (Acc $_{ID}$ = 0.703, Acc $_{OOD}$ = 0.493), demonstrating TitanVul's superior generalization power (Li et al., 29 Jul 2025).

Realistic Vulnerability Generation (RVG) augmentation—adding 100 synthetic context-rich samples per CWE—further boosts OOD real-world accuracy from 0.881 to 0.932 (+5.8%), with outsize effects for rare CWEs.

6. Context-Aware Modeling and Structured Reasoning

Context-aware approaches leveraging TitanVul systematically outperform function-only models. The CPRVul framework exemplifies this by:

Context Extraction: For each target function, constructing a Code Property Graph (CPG) to extract callers, callees, and related global variables.
Profiling and Ranking: Each context element is summarized by LLMs (e.g., GPT-4.1) into concise, security-relevant profiles and assigned a relevance score.
Input Assembly: The model receives the function, selected context profiles, code diffs, CVE/CWE descriptions, and commit messages.
Structured Reasoning: The model is prompted to emit a JSON trace containing “observation,” “security_reasoning,” “impact,” “is_vulnerable,” and a “confidence_score,” which is used as supervision for fine-tuning.

This pipeline, applied directly to TitanVul, achieves significant accuracy improvements. Case studies demonstrate that key cues—such as missing NULL checks in callers and data flow from globals—are essential for correct classification. A plausible implication is that comprehensive inter-procedural context and explicit reasoning are necessary for closing the generalization gap observed in prior datasets.

7. Implications and Significance

TitanVul represents a shift in vulnerability detection benchmarking by combining large scale, curated diversity, high-quality annotation, and deep context metadata (Li et al., 29 Jul 2025, Li et al., 6 Feb 2026). Unlike prior datasets, which may suffer from high duplication, noisy labeling, and functional misalignment, TitanVul enables models to generalize robustly to real-world settings. Its multi-agent LLM pipeline and RVG augmentation mitigate label noise and coverage gaps, especially for underrepresented CWEs.

The empirical finding that modest ID accuracy can coincide with leading OOD generalization contradicts standard validation heuristics and underscores the necessity of OOD-aware evaluation protocols in vulnerability detection research. This dataset forms the foundation for state-of-the-art structured reasoning approaches, demonstrating that only the pairing of high-impact, curated context with explicit security reasoning enables top-tier vulnerability classification performance.

Markdown Report Issue Upgrade to Chat

References (2)

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses? (2025)

Beyond Function-Level Analysis: Context-Aware Reasoning for Inter-Procedural Vulnerability Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TitanVul.