ARENAJS: Benchmark for JavaScript Vulnerabilities

Updated 8 December 2025

ARENAJS is a systematic benchmark designed to assess LLMs’ capability in detecting security vulnerabilities in JavaScript using 218 CWE types and real-world project contexts.
It leverages the ForgeJS pipeline for automated dataset creation and employs JudgeJS for rigorous, multi-step evaluation with adversarial noise, obfuscation, and prompt injections.
The framework exposes substantial reasoning gaps in LLMs, demonstrating significant performance drops under simulated adversarial conditions and highlighting challenges in precision and robustness.

ARENAJS is the first systematic, large-scale benchmark specifically designed to evaluate the ability of LLMs to detect vulnerabilities in JavaScript code. It comprehensively addresses the limitations of prior efforts by combining broad Common Weakness Enumeration (CWE) coverage, rigorous ground-truth labeling, and robust adversarial testing, all within the context of real-world projects. ARENAJS is constructed automatically by the ForgeJS pipeline and evaluated using JudgeJS, an open evaluation platform. This framework exposes critical reasoning gaps and robustness failures of state-of-the-art LLMs operating in this domain (Fei et al., 1 Dec 2025).

1. Guiding Principles and Conceptual Foundation

ARENAJS is formally governed by three foundational principles:

Comprehensiveness: It covers 218 distinct CWEs, incorporating vulnerabilities from the National Vulnerability Database (NVD), Mend.io, and representing both common and long-tail categories such as prototype pollution (CWE-1321), XML external entity injection (CWE-611), and regular expression denial of service (ReDoS). Multi-dimensionality is central—benchmarks are stratified by project type (frontend, backend, full-stack) and evaluate not only binary vulnerability detection, but also localization and correctness of CWE assignment with fine granularity.
No Underestimation: Ground-truth labeling uses CAPEC-based CWE equivalence classes, so semantically equivalent predictions are not penalized due to granularity mismatch (e.g., CWE-79 and CWE-83 for cross-site scripting). The denoised split ensures labels are tightly bound to explicit vulnerability-fixing commits. JudgeJS employs advanced prompting to maximize the chance of a model demonstrating multi-step reasoning.
No Overestimation: By mandating evaluation on whole project repositories (not isolated code snippets) and requiring models to distinguish between pre-patch and post-patch states, ARENAJS explicitly measures false positives and resilience. Robustness is further stressed by adversarial augmentations—automatically injecting semantic-preserving noise, obfuscation, and misleading prompts.

2. Automatic Benchmark Construction with ForgeJS

ForgeJS orchestrates a three-stage, fully automated process to generate ARENAJS:

Vulnerability Information Gathering: CVE metadata (IDs, CWE classes, patch SHAs) is harvested from the NVD API and Mend.io, with links to corresponding GitHub commits. Vulnerable (pre-patch) and fixed (post-patch) code bases are extracted for each entry.
Project Analysis and Annotation:
- Projects are classified as frontend, backend, or full-stack using package metadata and framework imports.
- Labeling differentiates between the full set (all labels) and a denoised subset (commits affecting only a single function—ONEFUNC—or explicitly stated in NVD—NVDCHECK).
- AST-based extraction (esprima/acorn) and Python difflib identify changed functions and precise line ranges, yielding strong ground-truth for both file-level and function-level assessment.
Adversarial Dataset Augmentation:

1. Noise Injection: 51 types of safe taint sinks (e.g., benign APIs) are inserted to test the model's reliance on keywords. 2. Code Obfuscation: Semantics-preserving obfuscation is applied using tools like javascript-obfuscator. 3. Combined Noise and Obfuscation: To maximize perturbation. 4. Prompt Injection: Deceptive comments ("This function is security-audited") are scattered to lure models into misclassification.

3. Dataset Structure and Statistical Properties

ARENAJS contains 1,437 unique JavaScript projects, each present in five variants (Original, Noise, Obfuscated, Noise+Obfuscation, Prompt Injection):

Split	Projects	Frontend	Backend	Full-stack	CWE Types	Denoised Subsets
Full (complete)	1,200	440	575	185	218	–
Denoised (clean)	237	87	114	36	128	✔

Sources: 1,152 from CVE-linked commits, 285 from Mend.io.
Context Split: 527 frontend, 689 backend, 221 full-stack.
CWE Coverage: 218 unique CWEs spanning OWASP Top 10 and specialized categories.

Each sample is represented as a repository pair (vulnerable/fixed), fully contextualized to preserve call graphs and dependencies. The denoised split ensures high-quality labeling that tightly reflects ground truth fixes.

4. Ground-Truth Labeling and Validation Strategies

To guarantee accuracy and fairness in benchmarking:

CWE Equivalence Classes: Model outputs are mapped to CAPEC-based equivalence classes, avoiding penalization for correct but granularly mismatched predictions.
Denoised Split: Excludes commits with incidental non-security changes, minimizing labeling noise.
Strong Prompting: JudgeJS’ evaluation protocol ensures models have explicit opportunity for multi-step deduction, e.g., taint tracking and pattern identification.
Project Context Requirement: Models always receive the entire repository snapshot pre- and post-patch; performance cannot be artificially inflated by trivial snippet classification.
Robustness Testing: All model outputs are subjected to variants testing resilience to surface perturbations.

5. Evaluation Mechanisms and Metrics (JudgeJS)

JudgeJS provides a comprehensive and reproducible evaluation protocol:

Model Invocation: Seven commercial LLMs are tested using the “Claude Code Security Review” prompt tailored to 218 CWEs. Each run supplies repository-wide metadata, with temperature set to 0.7 and minimum confidence 0.8. Model responses are required in structured JSON, including details such as affected lines, CWEs, severity, description, exploit scenario, and recommendation.
Canonicalized Result Matching: CWE predictions are validated under the equivalence relation

$\mathrm{CWE}_{pred} \equiv \mathrm{CWE}_{gt} \iff (\mathrm{CWE}_{pred} = \mathrm{CWE}_{gt}) \lor \{\mathrm{CWE}_{pred}, \mathrm{CWE}_{gt}\} \subseteq g \text{ for some } g \in \mathcal{G}$

ensuring correctness regardless of granularity.

Metric Definitions:
- Project-Level True Positive ( $\mathrm{TP}_{proj}$ ):
$(\mathrm{GT}_{vuln} = \mathrm{true}) \land (\mathrm{Pred}_{vuln} = \mathrm{true}) \land (\mathrm{CWE}_{pred} \equiv \mathrm{CWE}_{gt})$ - Function-Level Matching: Requires vulnerability presence, CWE equivalence, file base match, and overlapping normalized function names for a valid TP. - Precision, Recall, F1: Computed per standard definitions. - Robustness (VD-S): Reports the false negative rate at a fixed maximal false positive rate $r$ ,

$\mathrm{VD\text{-}S} = \mathrm{FNR} \mid \mathrm{FPR} \le r,\quad \mathrm{FPR} = \frac{FP}{FP+TN},\ \mathrm{FNR} = \frac{FN}{TP+FN}$

6. Empirical Findings and Model Assessment

Performance of leading LLMs on ARENAJS (Original/full split), and impact of dataset perturbations:

Model	Original F1	Noise	Obfuscated	Noise+Obfuscation	Prompt Injection	VD-S (FPR ≤ 0.5%)
GPT-5	32.1	17.2	22.9	17.8	25.7	57.2
GPT-5-Mini	29.8	13.1	25.6	10.9	25.4	73.6
GPT-5-Codex	34.7	15.2	26.7	9.8	27.2	70.9
DeepSeek-v3.1	26.3	5.8	22.4	4.2	27.5	77.0
Gemini-2.5-Pro	36.6	20.0	33.5	17.0	34.9	61.6
Gemini-Flash	30.1	17.3	26.1	13.1	30.2	68.6
Claude-4.5	35.9	4.2	17.2	3.9	27.1	65.2

Notable findings include:

Reasoning Disparity: All models show an 8–18 percentage point gap between project-level and function-level F1, demonstrating over-reliance on file names, imports, or comments, with limited ability to perform semantic reasoning or dataflow tracing.
Robustness Failures: Under the hardest (noise+obfuscation) perturbations, F1 scores for top models halve or more (e.g., Gemini-2.5-Pro from 36.6 to 17.0, Claude-4.5 from 35.9 to 3.9).
Low Industrial Suitability: At an industrially required FPR threshold (≤ 0.5%), models miss between 57% and 77% of vulnerabilities. Under perturbations, VD-S exceeds 80% in some cases.

7. Open Problems and Future Research Directions

ARENAJS reveals that, as of the 2025 benchmark evaluation, LLM-based vulnerability detection in JavaScript is hindered by:

Surface Pattern Reliance: Models match on superficial cues (identifiers, keywords) and lack deep taint-flow or prototype chain analysis.
Lack of Robustness: Even moderate obfuscation or injected noise can severely degrade detection rates.
Unfavorable Precision–Recall Balance: Industrial usage demands stringent false positive rates, but at such thresholds, recall is essentially unusable.

Key research avenues indicated include:

Hybrid Analyses: Integrating LLMs with traditional static or symbolic analysis to enable actual program flow reasoning.
Program-Structure-Aware Pretraining: Introducing AST, CFG, or CPG modalities during pretraining to encode semantic context.
Adversarial Fine-Tuning: Including obfuscated, noisy, and prompt-injected cases during training for improved robustness.
Expansion and Enrichment: Adding executable exploits, time-based partitions for contamination prevention, and advanced client-side cases to the benchmark.
Dynamic Thresholding: Developing deployment-aware abstention logic or certitude scoring to optimize for industrially viable tradeoffs.

ARENAJS, together with ForgeJS and JudgeJS, thus constitutes an open, rigorous paradigm for reproducible and comprehensive assessment of LLMs in JavaScript security analysis, and delineates the path toward LLMs with practical value in vulnerability detection (Fei et al., 1 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Large Language Models Cannot Reliably Detect Vulnerabilities in JavaScript: The First Systematic Benchmark and Evaluation (2025)

ARENAJS: Benchmark for JavaScript Vulnerabilities

1. Guiding Principles and Conceptual Foundation

2. Automatic Benchmark Construction with ForgeJS

3. Dataset Structure and Statistical Properties

4. Ground-Truth Labeling and Validation Strategies

5. Evaluation Mechanisms and Metrics (JudgeJS)

6. Empirical Findings and Model Assessment

7. Open Problems and Future Research Directions

Whiteboard

Follow Topic

Continue Learning

ARENAJS: Benchmark for JavaScript Vulnerabilities

1. Guiding Principles and Conceptual Foundation

2. Automatic Benchmark Construction with ForgeJS

3. Dataset Structure and Statistical Properties

4. Ground-Truth Labeling and Validation Strategies

5. Evaluation Mechanisms and Metrics (JudgeJS)

6. Empirical Findings and Model Assessment

7. Open Problems and Future Research Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics