FORGEJS: JS Vulnerability Benchmark
- FORGEJS is an automated benchmark framework for generating comprehensive, augmented JavaScript vulnerability datasets for LLM evaluation.
- It employs formal sampling procedures, targeted data augmentation, noise injection, code obfuscation, and prompt injection to enhance dataset robustness.
- The framework integrates with JUDGEJS and ARENAJS, providing standardized evaluation metrics such as precision, recall, and F1-score for vulnerability detection.
FORGEJS is an automatic benchmark generation framework designed to produce robust, realistic, and adversarially augmented datasets for systematic evaluation of LLMs in JavaScript vulnerability detection. Developed to address deficiencies in existing benchmarks—namely incomplete coverage, under- and overestimation of model capabilities—FORGEJS introduces new principles, formal sampling procedures, targeted data augmentation, and rich labeling to underpin quantitative, reproducible benchmarking of LLM-driven vulnerability analysis. It is a core component in constructing the ARENAJS benchmark and operates in concert with the JUDGEJS evaluation suite (Fei et al., 1 Dec 2025).
1. Foundational Design Principles
FORGEJS is predicated on three principles to produce an equitable and comprehensive benchmark.
Comprehensiveness is achieved by aggregating vulnerabilities across 218 CWE types, spanning both classic server-side faults (e.g., SQL Injection/CWE-89) and nuanced JavaScript-specific weaknesses (e.g., Prototype Pollution/CWE-1321, XXE/CWE-611). The benchmark ensures ecosystem representativeness by sampling frontend, backend (Node.js), and full-stack projects according to real-world distributions. Evaluation operates at multiple granularities: project-level (binary existence and CWE attribution) and function-level (file, function, line, CWE).
No Underestimation leverages CWE equivalence, mapping related vulnerabilities using MITRE CAPEC families, thereby avoiding penalization for minor misclassifications (e.g., reporting CWE-79 when the ground truth is CWE-83). Noisy ground truth is filtered via labels such as ONEFUNC/NVDCHECK/SUSPICION, enabling denoised subsets by excising ambiguous entries. Prompting follows the Claude-code-security-review template, promoting multi-step reasoning (pattern cataloging, taint-flow tracing, fix comparison).
No Overestimation mandates full-repository input: every evaluation case provides the complete project tree, including package manifests and tests, pre- and post-fix code states. Augmented adversarial variants—noise injection, code obfuscation, combined noise+obfuscation, and prompt injection—are generated per case to prevent models from exploiting superficial cues or context sparsity.
2. Framework Architecture and Workflow
FORGEJS operates through a structured three-stage pipeline:
Input Acquisition collects vulnerability metadata (CVE ID, CWE ID, severity, descriptions, publication date) and correlates it with patch information—specifically, GitHub commit SHAs denoting vulnerable (parent) and fixed (patch) project states. Both eligible codebase folders (pre- and post-fix) are downloaded.
Generation Pipeline:
- Vulnerability Gathering: The framework crawls the NVD API and Mend.io for CVE-linked GitHub repositories, parses patch commit URLs, and extracts the required file diffs per entry.
- Project Analysis & Ground-Truth Refinement: Projects are classified (frontend/backend/full-stack) via heuristics such as DOM calls or Express/Koa imports. Each entry is labeled ONEFUNC, NVDCHECK, or SUSPICION according to function change granularity and NVD mention. Vulnerable code localization uses regex or AST-based extraction (Esprima/Acorn) to enumerate function definitions, applies difflib.SequenceMatcher to identify changed lines, and maps vulnerabilities to functions and line ranges.
- Dataset Augmentation: FORGEJS injects approximately 51 decoy sink calls—such as
fs.appendFileSync,db.executewith literal input, orinnerHTMLwith non-tainted text—at random intervals, penalizing indiscriminate sink flagging. Code obfuscation employs thejavascript-obfuscatortool, renaming identifiers, encoding literals, and altering control flow. Noise and obfuscation are combined sequentially. Prompt injection intersperses misleading comments (e.g., false safety or vulnerability claims) throughout the code at a density of one per 50 lines.
3. Formal Benchmark Generation Algorithm
The selection procedure ensures balance and minimum representation across vulnerability classes:
Let be the set of Git-based projects, the vulnerability metadata, and the set of CAPEC-based CWE equivalence classes. The benchmark satisfies
- , (minimum samples per class)
- (total corpus size, e.g., 1,437)
For each , , and require . The procedure follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Input: C, V, equivalence classes G, targets τ_g, total N Output: B B = set() for g in G: candidates = shuffle(S_g) B.update(candidates[:τ_g]) R = (C×V) - B needed = N - len(B) B.update(select_uniformly(R, needed)) return B |
Upon subset selection, FORGEJS checks out each repository's versions, executes analysis and augmentation, and emits the final benchmark dataset.
4. Metrics and Evaluation Criteria
FORGEJS provides labeled corpora for evaluation but delegates actual LLM performance assessment to JUDGEJS, which computes:
Coverage Ratios:
- CWE Coverage:
- Context Coverage: Fraction of frontend/backend/full-stack cases
Confusion-Matrix Metrics (at project/function granularity):
- True Positives (TP)
- False Positives (FP)
- True Negatives (TN)
- False Negatives (FN)
- Precision
- Recall
- F1-Score
False Positive/Negative Rates:
- FPR
- FNR
Vulnerability Detection Score (VD-S):
With permitted false-positive rate (e.g., 0.5%), , VD-S subject to . JUDGEJS tunes model confidence thresholds so that FPR , reporting the corresponding FNR.
5. Case Study Examples of Dataset Augmentation
Examples below illustrate FORGEJS's adversarial augmentations designed to expose LLM robustness limits:
Noise Injection:
Original:
1 2 3 4 |
function authenticate(user) { const query = `SELECT * FROM users WHERE id=${user.id}`; return db.execute(query); } |
1 2 3 4 5 6 |
function authenticate(user) { // Noise sink: file write (no taint source) fs.writeFileSync('/tmp/log', 'login attempt'); const query = `SELECT * FROM users WHERE id=${user.id}`; return db.execute(query); } |
Code Obfuscation:
Original:
1 2 3 |
function run(cmd) { return exec(cmd); } |
1 2 |
var _0x3a2f = ['exec', 'run'], _0x9f2d=function(i){return _0x3a2f[i];}; (function(f){ return this[_0x9f2d(0)](f); })('run'); |
Prompt Injection:
False safety claim:
1 2 3 4 |
// This function has been audited–COMPLETELY SAFE function process(input) { document.body.innerHTML = input; // actual XSS vuln } |
1 2 3 4 |
// WARNING: SQL injection here! function loadAll() { return db.execute('SELECT * FROM posts'); // safe, parameterized } |
A plausible implication is that such augmentations systematically probe and reveal the surface-level heuristics LLMs may use, uncovering both excessive sensitivity and brittleness in reasoning.
6. Integration with JUDGEJS and Construction of ARENAJS Benchmark
FORGEJS emits ARENAJS, each entry comprising vulnerable and fixed codebase trees, metadata (CVE, CWE equivalence group, granularity labels), and five code variants (original, noise-injected, obfuscated, noise+obfuscation, prompt-injection). JUDGEJS consumes ARENAJS, invoking LLMs with the Claude-code-security-review prompt extended for all CWEs. Output is parsed (file, line, CWE, recommendations), evaluated using CWE equivalence and matching rules, and tabulated for precision, recall, F1, FPR, FNR, VD-S across project and function granularities.
Together, FORGEJS and JUDGEJS deliver the first large-scale, systematically generated, adversarially augmented JavaScript vulnerability benchmark—ARENAJS—with multidimensional metrics that correct for under- and overestimation, enabling rigorous, quantitative evaluation of LLM capabilities in real-world security scenarios (Fei et al., 1 Dec 2025).