Papers
Topics
Authors
Recent
Search
2000 character limit reached

JSIMPLIFIER: JavaScript Deobfuscation Dataset

Updated 25 January 2026
  • JSIMPLIFIER is a large-scale, richly annotated JavaScript dataset designed for rigorous deobfuscation and security research.
  • It comprises 44,421 samples—spanning malicious and benign scripts—with detailed metrics and annotations across 20 obfuscation techniques.
  • The dataset supports benchmarking and developing ML models and code analysis tools for effective malware detection and code simplification.

JSIMPLIFIER is a large-scale, richly annotated dataset designed to support rigorous research in JavaScript deobfuscation, malware detection, and code simplification. Constructed as part of the study “From Obfuscated to Obvious: A Comprehensive JavaScript Deobfuscation Tool for Security Analysis” (Zhou et al., 16 Dec 2025), it incorporates 44,421 samples—including real-world, wild malicious scripts (MalJS) and benign code (BenignJS)—with fine-grained annotations regarding twenty obfuscation families. Samples are supplied in both raw form and structured representations, together with comprehensive metrics, obfuscation indicators, and LLM-assigned readability scores. The dataset aims to provide a comprehensive resource for developing and benchmarking next-generation security tools and deobfuscation algorithms.

1. Composition and Collection Methodology

JSIMPLIFIER contains 44,421 total JavaScript samples, partitioned as follows:

Partition Number of Samples Collection Source/Method
MalJS 23,212 Wild JavaScript from a leading cybersecurity partner (July 2019–July 2025)
BenignJS 21,209 Web crawls (Top-1K Tranco domains) and manual curation (GitHub, PublicWWW)

MalJS was constructed by mining over 10.6 million JavaScript blobs, restricting to valid scripts via Meriyah syntax filtering, and deduplicating content (normalizing strings, stripping whitespace), yielding 4,470,565 unique scripts prior to sampling. Obfuscation complexity was quantified per sample; 700 scripts were selected per score (1–33), with all available high-complexity outliers (34: 3 samples, 35: 109 samples) included, creating a representative distribution across the obfuscation spectrum.

BenignJS amalgamates 19,209 scripts scraped from the Top 1,000 Tranco domains (obfuscation score ≥ 1) and 2,000 “clean” scripts sourced from 1,000 popular GitHub repositories and 1,000 PublicWWW snapshots, collected between April and May 2025, and manually validated as unobfuscated.

Labeling is sourced from established threat intelligence feeds and manual triage. MalJS samples carry the “wild malicious” tag. BenignJS scripts are selected either for their presence on popular, low-risk domains or by explicit manual auditing.

2. Obfuscation Taxonomy and Coverage

JSIMPLIFIER employs a taxonomy encompassing 20 obfuscation transforms, grouped into four distinct conceptual tiers:

  1. Lexical-level (5 techniques): identifier renaming (T0), indirect property access (T1), arithmetization (T2), string encoding (T3), boolean encoding (T4)
  2. Syntactic-level (6 techniques): expression-to-IIFE (T5), assignment-to-function (T6), string reversal (T7), AAEncode (T8), JJEncode (T9), JSF*CK (T10)
  3. Semantic-level (7 techniques): arrayizing strings (T11), string-array encoding (T12), JSON encoding (T13), RegExp encoding (T14), eval-packing (T15), control-flow flattening (T16), dead-code insertion (T17)
  4. Multi-layer (2 techniques): commercial OB combinators (T18), LLM-generated obfuscation (T19)

Each sample is annotated with a 20-bit flag vector indicating the presence of T0–T19. The MalJS partition contains at least one instance of every technique, with a mean of 8.32 techniques per script (range: 0–16). All lexical and syntactic obfuscation types are also represented in BenignJS (score ≥ 1). The CombiBench subset (1,296 samples) is specifically curated for multi-layer, cross-tier obfuscation, with each sample annotated by a “combination ID” that specifies the set of obfuscations applied.

3. Dataset Statistics and Metrics

JSIMPLIFIER includes per-sample metrics covering size, complexity, structure, and readability:

  • File sizes: MalJS average: 391.78 KB; BenignJS: 41.40 KB; CombiBench: 8.06 KB.
  • Obfuscation complexity: Computed as the sum of level-weighted techniques (lexical = 1, syntactic = 2, semantic = 3, multi-layer = 4), yielding a uniform histogram (scores 1–33) and high-complexity tail (34–35).
  • Token/AST distributions: For MalJS, mean token count ≈ 6,800 (median ≈ 3,100, σ ≈ 8,300). Mean AST node count ≈ 1,350 (median ≈ 860, σ ≈ 940), as extracted using Esprima/Babel and Joern.
  • Entropy measures:
    • Text level: Htext(X)=Hchar(X)+Hword(X)=ip(ci)log2p(ci)jp(wj)log2p(wj)H_{text}(X) = H_{char}(X) + H_{word}(X) = -\sum_i p(c_i) \log_2 p(c_i) - \sum_j p(w_j) \log_2 p(w_j)
    • AST structure: HAST(X)=w1Hnode_num+w2Hedge_num+w3Hnode_degree+w4Hnode_depthH_{AST}(X) = w_1 H_{node\_num} + w_2 H_{edge\_num} + w_3 H_{node\_degree} + w_4 H_{node\_depth}, with weights [0.61, 0.79, 1.58, 1.02].
  • Code complexity: Features Halstead length (HLoC) and effort (H_Effort); reductions quantified by Halstead Length Reduction (HLR), Halstead Effort Reduction (HER), and change in cyclomatic complexity ΔCC (Joern CFG-based).
  • LLM-based readability: Four LLMs (Claude-3.7-Sonnet, Gemini-2.5-Pro, DeepSeek-R1, GPT-o3) rated code on a 0–10 scale. Obfuscated code averages: [1.02–1.81]; deobfuscated: [6.21–7.83]; mean improvement: +466.94 %.

4. Dataset Splits, Usage, and Limitations

Recommended splits for reproducible experiments are stratified by obfuscation score: 70% train, 15% validation, 15% test, with CombiBench held out as a dedicated multi-layer challenge set.

Potential limitations are acknowledged:

  • MalJS collection is derived from a single (albeit extensive) corporate intelligence source, and may not fully capture rare, bespoke obfuscator variants.
  • Temporal span of MalJS (2019–2025) could bias towards ES5-era features; some ES6+ constructs are intentionally downgraded during preprocessing.
  • BenignJS sampling from top domains may skew distributions toward popular frameworks (e.g., jQuery, React), under-representing niche or emergent libraries.

A plausible implication is that while the dataset is extremely broad by current standards, fine-grained generalization across underrepresented obfuscators or the latest JavaScript language features may require supplementary data.

5. Format, Annotation, and Access

JSIMPLIFIER offers multiple modalities for each sample:

  • raw_js/: source .js files
  • ast_json/: AST dumps output via Esprima/Babel (JSON format)
  • metadata.json: per-sample record containing sample_id, source, size in bytes, token/AST counts, obfuscation score, 20-bit technique_flags, complexity score, and both entropy measures (text and AST)

The dataset is available under a CC-BY-4.0 license for data and MIT-style for code, facilitating open research use. Access is provided via Zenodo (DOI: 10.5281/zenodo.17531661) and GitHub (https://github.com/XingTuLab/JSIMPLIFIER, datasets/ subdirectory). Citation is as follows: Zhou et al., “From Obfuscated to Obvious: A Comprehensive JavaScript Deobfuscation Tool for Security Analysis,” NDSS 2026 (Zhou et al., 16 Dec 2025).

6. Applications and Research Utility

JSIMPLIFIER is engineered to supply security researchers and software analysis practitioners with a comprehensive corpus for reproducible evaluation and comparative benchmarking of deobfuscation tools, malware detectors, and automated code analysis systems. By providing real-world and synthetic samples across every major obfuscation type, combined with detailed metrics and rich annotations, JSIMPLIFIER establishes new benchmarks for assessing code readability, structural simplification, and adversarial robustness in JavaScript security research. Key supported use cases include:

  • Training and validation of machine-learning models for automated deobfuscation and malware classification
  • Development of AST-based analysis and code normalization techniques
  • Robustness evaluation for detection tools against complex, multi-layer obfuscation seen in the wild
  • Quantitative analysis of code complexity, entropy, and human or LLM-based readability improvements after processing

The dataset thus directly addresses the need for standardized, real-world testbeds for security-oriented JavaScript analysis at scale (Zhou et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to JSIMPLIFIER Dataset.