Android Malware Datasets Overview

Updated 16 July 2025

Android malware datasets are structured collections of Android applications labeled as malicious or benign, often enriched with family and behavioral annotations.
They incorporate diverse feature modalities such as static, dynamic, visual, and longitudinal data to support detection, attribution, and concept drift analysis.
Robust evaluation protocols and bias control techniques in these datasets enable reproducible research and accurate benchmarking in malware security studies.

Android malware datasets are structured, curated corpora of Android applications labeled according to their malicious or benign status, often accompanied by family or behavioral annotations and feature representations conducive to machine learning, security analysis, and the benchmarking of detection or classification systems. These datasets underlie almost all empirical progress in Android malware research, supporting detection, attribution, family clustering, dynamic analysis, the paper of concept drift, the extraction of attack tactics and techniques, and the reproducibility of results in the academic community.

1. Historical Evolution and Dataset Taxonomy

Early Android malware datasets, such as the Malware Genome Project, focused on small-scale, manually curated collections of malicious APKs labeled by family, often compiled from security forums and early malware reports. Later efforts produced larger-scale, semi-automatically labeled datasets, leveraging sources like VirusTotal for label assignment and repositories such as Drebin (5,560 malware samples, 179 families, 2010–2012 window), AndroZoo (over 12 million apps with regular updates), VirusShare, and the Android Malware Dataset (AMD) (Kouliaridis et al., 2020).

Dataset labels evolved in granularity: many datasets offer only binary labels (“malware” vs. “benign”), while others provide family labels (e.g., in Drebin/AMD) or, more recently, multi-label behavioral annotations derived from frameworks such as MITRE ATT&CK (Arikkat et al., 20 Mar 2025). A further trend is inclusion of both static (manifest features, permissions, code) and dynamic (runtime traces, API calls, network traffic, device state) aspects, as in CCCS-CIC-AndMal2020 (400K apps, static and dynamic features, 14 categories, 180 families) or the MalVis and LAMDA datasets, which provide new modalities and longitudinal scope (Fiky et al., 2021, Makkawy et al., 17 May 2025, Haque et al., 24 May 2025).

A representative taxonomy:

Dataset Name	Year(s)	Label Granularity	Modality	Size
Malware Genome Project	2010–11	Family (manual)	Static	~1K malware
Drebin	2010–12	Family, Binary	Static	5.6K malware
AMD	2010–16	Family, Binary	Static	24.5K malware
VirusShare	2012–20	Binary	Static	100K+
AndroZoo	2012–20	Binary	Static	12M+
CCCS-CIC-AndMal2020	2020	Category, Family	Static, Dyn	400K
LAMDA	2013–25	Family, Singleton	Static	1M+
MalVis	2025	Family, Binary	Visual	1.3M images

2. Dataset Construction, Labeling, and Bias

Labeling strategies have major implications for both research validity and detector performance. Most recent datasets use VirusTotal thresholds to determine malware status (e.g., flagged by ≥2 engines is “malicious”), but this parameter varies widely: Drebin uses 2/10, AMD ≥28 engines, Piggybacking ≥1, TESSERACT 4 (Lin et al., 2022). These variations can yield accuracy/recall differences of up to 21.5% on the same data.

Dataset family composition is crucial: certain families dominate (e.g., top 3 families in MalGenome constitute 70% of samples), and experiments show that detection accuracy decreases as the number of families increases, with up to 20% variation depending on which families are included. Poor partitioning—where train/test splits have imbalanced or disjoint family coverage—can result in dramatic overestimates or underestimates of detection performance. Intentional deduplication and stratified sampling have become standard recommendations to control such biases (Surendran, 2021, Lin et al., 2022, Alam et al., 11 Sep 2024).

Additionally, “contaminant” samples—malware disguised in the benign dataset or vice versa—can lead to significantly underestimated error rates if not systematically eliminated. The use of Positive and Unlabeled (PU) learning frameworks, such as PUDroid, directly addresses this issue, calibrating label confidence and automatically identifying possible contaminants (Sun et al., 2017).

3. Feature Modalities and Dataset Modalities

The modality of data determines which behavioral and structural traits are available for analysis:

Static Datasets: Extracted from APKs without execution; features include permissions, manifest fields, API usage, bytecode n-grams, intent filters, components, and more (Drebin, Malware Genome, AMD).
Dynamic Datasets: Capture runtime behavior, including resource usage, network flows, API call traces, system logs, and process activity, often collected under emulation or real device execution (CCCS-CIC-AndMal2020, (Massarelli et al., 2017, Papadopoulos et al., 2023, Sharma et al., 3 Mar 2025)).
Visual Datasets: Raw bytecode is rendered as images (e.g., via entropy and n-gram encoded RGB channels in MalVis), enabling the application of CNNs and visual interpretability (Makkawy et al., 17 May 2025).
Longitudinal Datasets: Chronologically capture app evolution, enabling concept drift studies (LAMDA spans 2013–2025, with over 1,380 families and 1M+ samples, supporting temporal analysis of concept and explanation drift (Haque et al., 24 May 2025)).
Multi-Label Behavioral Datasets: Map apps directly to MITRE ATT&CK Tactics, Techniques, and Procedures (TTPs), as in DroidTTP (Arikkat et al., 20 Mar 2025).

Effective datasets explicitly enumerate extraction pipelines, feature schemas, label schemes, and provide raw metadata (hashes, timestamps, VirusTotal reports, etc.) for reproducibility.

4. Evaluation Protocols and Methodological Challenges

Variation in evaluation protocols has historically produced inconsistent results. Classic approaches use k-fold cross-validation or random holdout splits. Recent work emphasizes time-aware or sliding-window splits, where models are trained on past data and tested on future data to mimic operational deployment and capture concept drift (Alam et al., 11 Sep 2024, Haque et al., 24 May 2025).

The presence of duplicates or semantically similar apps within and across dataset splits leads to inflated performance. Studies have shown detection rates (True Positive Rate) drop from 0.95 to 0.91 for API-based classifiers when removing duplicates (ε = 0) and even further as the threshold for semantic similarity increases. Clustering algorithms using opcode subsequences and the Ochiai coefficient are applied to filter such cases, reducing this form of bias (Surendran, 2021).

Dataset bias can also arise from differences in labeling standards, family imbalance, or adversarial manipulation, and recent studies call for “machine learning fairness frameworks” and detailed dataset reporting to ensure sound inter-paper comparisons and credible conclusions (Lin et al., 2022).

5. Diverse Application Scenarios and Benchmarking

Android malware datasets are utilized in a wide array of research settings:

Binary and Multiclass Detection: Most studies focus on distinguishing benign/malicious apps (binary) (Li et al., 2018, Chavan et al., 2019), or classifying malware into families or categories (multiclass) (Fan et al., 2021, Fiky et al., 2021).
Family Attribution & Clustering: Clustering on features such as API call relationships and malicious payload mining uncovers fine-grained family structure and variant evolution, overcoming the effect of repackaging and third-party libraries (Li et al., 2017).
Feature Importance Analysis: Information Gain (IG), chi-square, and contemporary measures such as crRelevance and NMRS are used to rank feature salience (permissions, intents, opcodes, etc.), with notable temporal shifts (intents have grown in importance in newer datasets vs permissions in older datasets) (Kouliaridis et al., 2020, Sharma et al., 3 Mar 2025).
Concept Drift and Longitudinal Analysis: LAMDA enables the paper of performance decay over years, with detection F1 scores declining from approximately 97.5% to 47.2% as time between training and test increases (Haque et al., 24 May 2025).
Interpretation and Attribution: Large datasets mapped to ATT&CK TTPs enable not only detection but also insight into adversary tactics via machine learning (Problem Transformation Approach, Label Powerset with XGBoost, fine-tuned LLMs) (Arikkat et al., 20 Mar 2025).
Dynamic Analysis and Confidence Guarantees: Datasets with device-state traces are used to power semi-automated detection with conformal prediction and label-conditional confidence sets (Papadopoulos et al., 2023).
Benchmarking and Fair Comparison: Recent systematic reviews demonstrate that, when hyperparameter tuning and bias controls are properly executed, traditional models like Random Forest and XGBoost often outperform or match more complex deep learning architectures, challenging prior claims in the literature (Alam et al., 11 Sep 2024, Liu et al., 20 Feb 2025).

A summary table of recent dataset types and use cases follows:

Dataset	Main Modalities	Primary Usage	Notable Features
Drebin	Static, family-labeled	Binary/family detection	Eight feature sets, 2010–12
CCCS-CIC-AndMal2020	Static/Dynamic	Category & family detection	API, process, battery, 141 feats
LAMDA	Static, longit., family	Concept drift, explainability	1M+ apps, 12 years, 1,380 fams
MalVis	Visual (RGB images)	CNN-based malware recognition	1.3M images, entropy/N-gram
CorrNetDroid	Dynamic (network flows)	Efficient feature selection	NMRS, crRelevance, 2 features
DroidTTP	Static, multi-label	TTP mapping (MITRE ATT&CK)	Label Powerset, LLMs, explain.

6. Public Accessibility, Reproducibility, and Tooling

Researchers emphasize public release of datasets, with rich metadata—app IDs, hashes, labels, year, and sometimes feature matrices—to enable reproducibility (Liu et al., 20 Feb 2025). Open-source codebases support extensibility and retrospective analysis (examples include MalDozer, LAMDA, and benchmarking repositories in recent literature).

The structure of dataset modules in these frameworks is typically modular, handling loading, deduplication, and transformation in one entity; models (ML/DL) in another; and task orchestration (offline, active, continual learning) in a third, with configuration managed in a versionable format (e.g., YAML) (Alam et al., 11 Sep 2024). Feature matrices are stored in efficient (often sparse) formats, and full archival of code and artifacts is increasingly the norm for top-tier benchmarks.

7. Future Directions and Active Challenges

The field continues to grapple with evolving challenges including:

Label quality and updating: Ensuring that labels reflect the current threat landscape, especially as malware becomes more stealthy and the number of ambiguous or singleton samples grows (Haque et al., 24 May 2025).
Family and behavioral annotation: Enriching datasets with high-fidelity multi-label or multi-view annotations, such as those based on the MITRE ATT&CK framework (Arikkat et al., 20 Mar 2025).
Concept and explanation drift: Quantitatively tracking and adapting to month-on-month changes in feature distributions and explanatory attributions, leveraging metrics like Jeffreys divergence, Jaccard, and Kendall distances (Haque et al., 24 May 2025).
Hybrid and cross-modal benchmarking: Combining static, dynamic, visual, and network-based datasets, and coordinating benchmarks across all modalities—a necessary step given the sophistication of contemporary malware.
Cost-effective, timely, and explainable systems: Balancing the trade-off between high-complexity deep learning systems and more efficient traditional models, with a growing emphasis on providing actionable explanations and risk confidence at inference time (Papadopoulos et al., 2023, Liu et al., 20 Feb 2025).

In conclusion, Android malware datasets have evolved from small-scale, manual, static collections to extremely large, multi-modal, longitudinal resources supporting rigorous and reproducible research on detection, attribution, drift adaptation, and behavioral intelligence. Dataset construction, labeling standards, deduplication, feature extraction, and sharing practices are central determinants of research reliability, and are increasingly subject to principled scrutiny and methodological advancement. These datasets now serve as the backbone for both empirical benchmarking and the theoretical understanding of the adversarial dynamics in the Android malware ecosystem.