Papers
Topics
Authors
Recent
2000 character limit reached

EMBER Dataset for Malware Analysis

Updated 29 November 2025
  • EMBER dataset is a canonical benchmark family offering comprehensive labeled repositories for ML-based malware detection and analysis.
  • It supports multiple file formats and tasks, including binary classification, family tagging, behavior identification, and semantic ontology integration.
  • Its rigorous feature extraction pipelines and standardized protocols enable reproducible evaluations and robust research in malware analytics.

The EMBER dataset series constitutes a canonical family of benchmarks for machine-learning-driven malware analysis, beginning with its original release for static Windows PE file classification and evolving into multi-format, multi-task repositories supporting advanced experimental protocols. EMBER provides feature-rich, openly licensed resources for malware detection, family classification, behavior tagging, and semantic data representation. This encyclopedic entry catalogs the EMBER concept, its expansion to EMBER2024 and EMBERSim, and its integration with semantic ontologies for explainable security analytics.

1. Genesis and Motivation

The foundational EMBER dataset, introduced by Anderson & Roth in 2018 as the "Endgame Malware BEnchmark for Research," filled a crucial gap in the availability of large, fully labeled corpora for machine learning on portable executables (PE) (Anderson et al., 2018). EMBER was explicitly designed to serve the information security research community by providing a benchmark analogous to ImageNet, but for static malware detection.

Key motivations included:

  • Enabling robust supervised learning with balanced malicious/benign samples.
  • Facilitating research into concept drift through temporal partitioning of data.
  • Supporting semi-supervised representation learning via an unlabeled cohort.
  • Driving reproducibility and comparative evaluation by releasing the full feature extraction pipeline, vectorization routines, and baseline model code.

The resulting corpus catalyzed ML-based static malware analysis and rapidly became the de facto reference for PE malware research.

2. Dataset Structures, Splitting, and Labeling Protocols

EMBER (2018/2019)

Original EMBER comprises 1.1 million PE samples with supervised and unsupervised splits (Anderson et al., 2018, Corlatescu et al., 2023):

  • Training: 900,000 instances (300k malicious, 300k benign, 300k unlabeled).
  • Testing: 200,000 (100k malicious, 100k benign).
  • Temporal split: training set drawn from early periods, test set from the final two months, enabling realistic concept drift studies.

Labels encode binary class (malicious/benign), with –1 indicating unlabeled samples. The intentional inclusion of unlabeled data models the scarcity of reliable ground-truth labels observed in operational deployments.

EMBER2024

The EMBER2024 expansion encompasses 3,239,255 files covering six file formats: Win32/Win64 PE, .NET, APK (Android), ELF (Linux), PDF (Joyce et al., 5 Jun 2025):

  • Time-windowing: Files collected daily over 64 weeks (Sep 2023–Dec 2024), with benignant/malicious assignments based on late VirusTotal consensus.
  • Deduplication: TLSH (fuzzy hashing) is employed with strict distance cutoffs to minimize near-duplicates (false-positive rate ≈0.0018%).
  • Splits:
    • Training: 2,626,000 files (weeks 1–52).
    • Testing: 606,000 files (weeks 53–64).
    • Challenge set: 6,315 malicious samples that initially escaped detection by all AV engines (0/≈70 positives, later ≥5 positives).

The inclusion of a "challenge" subset with proven evasion characteristics supports dedicated evaluation of detection robustness on adversarial samples.

3. Feature Representations and Extraction Pipelines

EMBER Static Feature Design

Raw feature vectors are statically extracted from PE binary structure, with eight primary categories (Anderson et al., 2018, Å vec et al., 18 Mar 2024):

  • General file traits: byte size, virtual size, import/export/symbol counts, debug/resources/relocations/signature/TLS flags.
  • COFF & optional headers: timestamps, machine types, PE magic, subsystem, DLL characteristics, version and sizing fields.
  • Imports/exports: DLLs and function names aggregated through feature hashing (DLLs: 256 bins, DLL:Func: 1024 bins, exports: 128 bins).
  • Section metadata: name, size, virtual size, entropy, code/data flags, entry point location; hashed into 50-bin vectors.
  • Byte-histogram: 256 bins, normalized as h^i=hi/∑j=0255hj\hat h_i = h_i / \sum_{j=0}^{255} h_j for i=0,…,255i=0,\dots,255.
  • Byte-entropy histogram: 16 × 16 joint bins computed over sliding windows (length 2048, step 1024).
  • String statistics: counts and entropy of printable substrings, length distributions, counts of path/URL/registry patterns, MZ indicators.

Feature hashing is adopted throughout for compact numerical representation of high-cardinality sets. The canonical EMBER vector is 2,351 dimensions. Extraction is implemented in Python atop LIEF, with re-vectorization and incremental sample addition supported programmatically.

EMBER2024 Feature Set

EMBER2024 introduces EMBER v3—a 2,568-dimensional vector, augmented for multi-format compatibility (Joyce et al., 5 Jun 2025):

  • Core header fields and entropy features (4 dims).
  • Byte and byte-entropy histograms (256 + 256 dims).
  • Extensive string feature patterns (∼\sim80 dimensions).
  • DOS, COFF, Optional, Section, Data Directory, Rich header fields (with coverage for non-PE formats).
  • Authenticode signature and parsing warning features (88 flag indicators).
  • Imports/exports are again feature-hashed.

For non-PE files (ELF, APK, PDF), only the first 696 dimensions are guaranteed nonzero (densely populated).

4. Tasks, Benchmarks, and Evaluation Protocols

Supported Tasks

EMBER2018 and EMBER2024 both empower multi-faceted evaluation (Anderson et al., 2018, Joyce et al., 5 Jun 2025):

  • Binary classification: malware vs. benign.
  • Family classification: multiclass (EMBER2024: 6,787 families, 2,538 with ≥10 instances, largest families >10,000).
  • Behavior tagging: multi-label, 118 behavior tags (e.g., ransomware, worm).
  • File property prediction: multi-label (30 tags).
  • Packer recognition (52 packer types), exploited vulnerability (293 CVE tags), threat group attribution (43 APT groups).

EMBER2024 adheres to stratified splits and provides per-task label sets.

Benchmark Results

EMBER:

  • LightGBM (100 trees, 31 leaves/tree, <10,000 parameters): ROC AUC = 0.99911, TPR ≈ 92.99% @ FPR 0.1%, TPR ≈ 98.20% @ FPR 1% (Anderson et al., 2018).
  • MalConv (featureless CNN): ROC AUC = 0.99821, TPR ≈ 92.2% @ FPR 0.1%, TPR ≈ 97.3% @ FPR 1%.

EMBER2024:

  • LightGBM (64 leaves, 500 rounds): ROC AUC = 0.9949 on test, TPR = 94.48% @ FPR = 1%.
  • PR AUC on challenge set (evasive malware): 0.5722, demonstrating increased difficulty for AV-resistant files.
  • Family classification: accuracy = 67.97%, weighted F1 = 0.6664, macro F1 = 0.4371.
  • Multi-label tag predictors: average AUC 0.7462 (file property), 0.8310 (packer); lower macro-precision/recall due to label imbalance.

Evaluation Metrics

Protocols standardize on TPR, FPR, Precision, Recall, F1, ROC and PR AUC throughout (Anderson et al., 2018, Joyce et al., 5 Jun 2025).

5. Augmentation, Semantic Representation, and Advanced Usage

EMBERSim: Similarity Search Augmentation

EMBERSim extends EMBER2018 as a large-scale similarity benchmark for ML-based neighbor retrieval and evaluation (Corlatescu et al., 2023):

  • Leaf-prediction similarity: utilizes XGBoost classifiers, generating per-sample leaf-index vectors across ensemble trees, and defines leaf proximity as the fraction of joint leaf assignments,

LeafSimilarity(x1,x2)=1T∑i=1T1[x1(i)=x2(i)]\text{LeafSimilarity}(x_1, x_2) = \frac{1}{T} \sum_{i=1}^T \mathbf{1}[x_1^{(i)} = x_2^{(i)}]

  • Top-100 neighbor indices per sample are precomputed.
  • Integration with AVClass2 for automated tagging: FILE, CLASS, FAM, BEH labels derived from VirusTotal detection names.
  • Co-occurrence enrichment: additional tags assigned when co-occurrence frequencies surpass predefined thresholds.
  • Full feature, label, tag, index matrices released in CSV, Parquet, and NPZ formats; reproducible pipelines for all augmentation steps.

Semantic Data Representation: PE Malware Ontology

A PE Malware Ontology, motivated by EMBER, represents features as OWL 2 classes and properties for explainable ML (Å vec et al., 18 Mar 2024):

  • Core: 195 classes, 6 object properties, 9 data properties (e.g., has_section, has_file_feature, has_action).
  • Full mapping from EMBER's static features to ontology elements (FileFeature, SectionFeature, SectionFlag, Action classes).
  • Concept definitions and semantic restrictions formalized in Manchester OWL, e.g.,

⩾2 has_section.(∃ has_section_feature.{high_entropy}  ⊓  ∃ has_section_feature.{nonstandard_section_name})\geqslant 2\,\mathit{has\_section}.\bigl( \exists\,\mathit{has\_section\_feature}.\{\mathit{high\_entropy}\} \;\sqcap\; \exists\,\mathit{has\_section\_feature}.\{\mathit{nonstandard\_section\_name}\} \bigr)

  • OWL-encoded data releases for various sample sizes (1k–800k), facilitating concept learning and explainable classification.
  • DL-Learner concept learning results yield interpretable, albeit less precise, classifiers (F1: 0.68–0.77 vs. ML baselines at 0.90+).

This suggests semantic enrichment supports structured reasoning, explainability, and symbolic ML over EMBER-derived corpora.

6. Technical Impact and Research Use-Cases

The EMBER series is extensively leveraged for:

  • Comparative ML benchmarking (new models vs. canonical LightGBM/MalConv) (Anderson et al., 2018).
  • Semi-supervised and unsupervised representation learning.
  • Temporal drift and adversarial robustness analyses (with explicit unlabeled and challenge sets).
  • Feature engineering and vectorizer experimentation (hash bucket sizing, alternative encodings).
  • Family/behavior/packer/group labeling for interpretability and threat intelligence.
  • Ontology-driven explainable analytics, concept learning, and neuro-symbolic integration (Å vec et al., 18 Mar 2024).
  • Similarity search benchmarking in malware retrieval and clustering (Corlatescu et al., 2023).

Recommendations for advanced use include extending feature pipelines to dynamic analysis, pre-training representation models on unlabeled samples, and applying robust/online learning strategies to new malware families.

7. Distribution, Reproducibility, and Future Expansion

The EMBER family provides open-source codebases for feature extraction, vectorization, and model training (Python modules; Rust for metadata and file acquisition in EMBER2024). Complete splits, feature vectors, tag matrices, and similarity indices are distributed in standardized formats (JSONL, CSV, Parquet/NPZ, OWL/RDF), with reproducibility scripts supporting exact pipeline replication given raw binaries and API-keyed access to VirusTotal.

Public repositories include:

A plausible implication is that as new binary formats and attack vectors enter the threat landscape, future EMBER releases will continue to increase in scope and semantic richness, underpinning both subsymbolic and symbolic ML research in malware analysis.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to EMBER Dataset.