Papers
Topics
Authors
Recent
2000 character limit reached

Real-World Obfuscated JavaScript Dataset

Updated 23 December 2025
  • The real-world obfuscated JavaScript dataset is a curated collection of both malicious and benign code samples employing diverse obfuscation techniques.
  • It leverages systematic collection and preprocessing methods—including syntax validation and stratified sampling—to ensure comprehensive coverage across techniques.
  • Rich metadata with ground-truth labels and complexity metrics enables precise benchmarking for deobfuscation, malware detection, and security analysis research.

A real-world obfuscated JavaScript dataset is an empirically derived collection of JavaScript code samples representative of scripts encountered in the wild that have undergone various obfuscation transformations. Such datasets are designed to support research and benchmarking in code deobfuscation, malware detection, and program analysis. Datasets of this class characteristically include both malicious and benign code, cover a diverse range of obfuscation techniques, and provide detailed ground-truth metadata for use in empirical evaluation of static and dynamic analysis tools, code LLMs, and security research methods.

1. Composition and Scope

Several benchmark datasets now define the state-of-the-art in real-world JavaScript obfuscation research. The JSIMPLIFIER dataset is the largest cohesive release at present, comprising 44,421 samples, with 23,212 labeled as malicious (MalJS) and 21,209 as benign (BenignJS). The distribution is 52.3% malicious and 47.7% benign, enabling balanced evaluation for both detection and deobfuscation tasks (Zhou et al., 16 Dec 2025). Samples span July 2019 – July 2025, derived from VirusTotal-like feeds, malware campaign repositories (hundreds of campaigns, e.g., SocGholish, JSFireTruck), top Tranco 1K domains, trending GitHub repositories, and web page crawls.

Other notable datasets include JsDeObsBench, which contains 36,260 obfuscated variants (synthetic, derived from CodeNet judge solutions) and 4,515 real-world malware samples, and ObscuraX, a 55M parallel corpus over seven languages where the TypeScript/JavaScript fraction is 5.4M source–obfuscated pairs, though ObscuraX focuses solely on identifier renaming (Chen et al., 25 Jun 2025, Paul et al., 27 Mar 2025). WebEye (2018) offers ≈21,676 malicious JavaScript-bearing payloads from a two-month global web crawl, enriched with HTTP/HTML context (Vierthaler et al., 2018).

Dataset Composition Table

Dataset Total Samples Malicious Samples Benign Samples Technique Coverage Access/License
JSIMPLIFIER 44,421 23,212 21,209 20, wide spectrum Zenodo, CC BY-NC-SA 4.0
JsDeObsBench 40,775 4,515 36,260 7, prevalent GitHub, Apache 2.0
ObscuraX 5,403,241* all benign all benign identifier renaming Project page, MIT-style
WebEye ~21,676 ~21,676 N/A obfuscated malware By request, research license

*TypeScript ("JavaScript") subset only.

2. Data Acquisition and Preprocessing

Collection methodologies span direct aggregation from security telemetry (JSIMPLIFIER), systematic application of obfuscators to clean corpora (JsDeObsBench), and large-scale parallel mining of open-source code (ObscuraX). JSIMPLIFIER's malicious component was gathered by a major cybersecurity partner, utilizing VirusTotal-like feeds and campaign repositories over several years. Files undergo syntax validation (Meriyah), deduplication, obfuscation scoring, and stratified sampling for coverage across score ranges 1–35.

Benign sets in JSIMPLIFIER primarily derive from the top 1K Tranco domains (with scripts requiring an obfuscation score ≥1) and are augmented with 2,000 manually verified clean files from highly starred GitHub repositories and PublicWWW crawls. The ground-truth labeling uses AV consensus for malicious, and manual code review plus dominance of reputable domains for benign.

ObscuraX employs a Tree-sitter-driven syntax tree pipeline, sampling up to 90% identifier renaming per file for variables, functions, classes, and (optionally) imports, guaranteeing semantics-preserving transformations. JsDeObsBench constructs synthetic obfuscated samples by chaining up to seven transformations from the JavaScript-Obfuscator tool, verified for correct behavior with execution-based testcases.

3. Obfuscation Techniques and Complexity

Comprehensive real-world JavaScript obfuscation is characterized by a diverse arsenal of transformations. JSIMPLIFIER enumerates 20 major techniques, classified as: lexical-level (identifier renaming, indirect property access, etc.), syntactic-level (function/assignment transformations, AAEncode, JSF*CK), semantic-level (string-array encoding, eval-based packing, control-flow flattening, dead code), and multi-layer (commercial/proprietary obfuscators and LLM-driven obfuscation). Average complexity for malicious samples is 8.32 techniques per file (range 0–16).

In contrast, JsDeObsBench restricts coverage to seven canonical methods, each implemented with well-specified tooling parameters (e.g., name/string obfuscation, control-flow flattening, debug/self-defending logic). ObscuraX is strictly limited to identifier renaming and import masking, resulting in lower semantic distortion but systematic coverage of surface form obfuscation.

4. Dataset Structure and Metadata

Datasets provide rich metadata to facilitate downstream analysis. JSIMPLIFIER archives each sample alongside a JSON file specifying sample ID, label, obfuscation score, techniques applied (from T0–T19), file size, text and AST entropy, control-flow graph (CFG) nodes, and data-dependence graph (DDG) edges. Additional exports include Esprima/Babel-format ASTs for original and deobfuscated versions, control/data-flow graph representations (Joern/Neo4j), and various complexity metrics (Halstead, cyclomatic complexity, entropy).

JsDeObsBench correlates every sample’s obfuscated version with the ground-truth original, executable test cases, and metadata including tracked obfuscation combinations and behavioral fingerprints. ObscuraX provides parallel pairs (source, obfuscated) and relies on statistical obfuscation parameters recorded per sample.

5. Evaluation Metrics and Benchmarking

Multi-dimensional benchmarks are critical for rigorous deobfuscation research. JSIMPLIFIER benchmarks deobfuscators by coverage (20/20 technique handling), parser-based correctness, CFG and DDG semantic preservation (similarity scores ≈93.78% and ≈95.84%, respectively), Halstead Length Reduction (HLR = 0.8820 on CombiBench), entropy metrics (35–50% median reduction), and LLM-based readability improvement (+466.94% on a 0–10 scale across four LLMs). Pre- and post-cyclomatic complexity for MalJS shifts from an average of 14.7 nodes down to 3.9, while Halstead effort is reduced by 92.9% (Zhou et al., 16 Dec 2025).

JsDeObsBench introduces a modular pipeline: syntax correctness (Esprima parse rate), execution correctness (test case pass rate), simplification (Halstead length reduction), and CodeBLEU for structural similarity. One-shot prompting boosts syntax and execution accuracy by 11% and 14%, respectively, and LLMs outperform baselines in code simplification but struggle with semantic-equivalent variable renaming (Chen et al., 25 Jun 2025).

ObscuraX emphasizes extrinsic evaluation: downstream improvements in LLM pre-training (perplexity reduction, pass@k, ROUGE), relying on correct mapping from obfuscated to source code (Paul et al., 27 Mar 2025). WebEye provides 58 static features for each HTTP/JS sample, enabling classical and ML-based malware classification (Vierthaler et al., 2018).

6. Accessibility, Licensing, and Intended Use

JSIMPLIFIER’s dataset is available via Zenodo and as a GitHub submodule, distributed under a CC BY-NC-SA 4.0 license. Usage requires identity verification and a declaration of intended purpose (to constrain potential misuse in malware development). JsDeObsBench is fully open (Apache 2.0), with scripts, leaderboard, and Docker-based pipeline for reproducible benchmarks. ObscuraX is MIT-style licensed, facilitating Code-LM development and evaluation. WebEye is accessible by research request to Fraunhofer AISEC, with no direct open download link.

Foreseen uses include: JavaScript deobfuscator and malware detector development; benchmarking of both traditional and LLM-based deobfuscation models; code complexity and entropy studies; LLM-guided variable renaming evaluation; and ongoing code security tool validation. Noted limitations include potential overrepresentation of certain malware families, benign script bias toward high-profile frameworks, and the absence of cryptographically packed scripts requiring external keys. Standard mitigations are proposed: community-driven campaign augmentation, additional benign corpus inclusion, and continual acquisition of fresh samples.

7. Impact and Research Integration

The release of large-scale, richly annotated, and real-world–balanced obfuscated JavaScript datasets directly advances empirical research in deobfuscation, secure code analysis, and security ML. JSIMPLIFIER establishes new benchmarks for comprehensive technique coverage and multi-faceted evaluation, enabling rigorous comparison across deobfuscators with strong reproducibility. JsDeObsBench catalyzes LLM-based evaluation pipelines by linking execution correctness with code structure analysis. ObscuraX underpins advances in syntax-semantics disentanglement for Code-LMs. WebEye remains a reference for malware classifier training and deployment validation with automated, real HTTP traffic and context-rich enrichment.

Consistent dataset-driven research highlights core challenges: semantic preservation under complex obfuscations, deconvolving obfuscation from malware functionality, and generalizing across ever-evolving obfuscators in the wild. These releases provide both the data foundations and the empirical standards for the next generation of JavaScript security and program analysis research (Zhou et al., 16 Dec 2025, Chen et al., 25 Jun 2025, Paul et al., 27 Mar 2025, Vierthaler et al., 2018).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Real-World Obfuscated JavaScript Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube