Malware Source Code Dataset Research

Updated 7 December 2025

Malware source code datasets are curated corpora of annotated malicious code that enable empirical research into malware evolution and vulnerability propagation.
They incorporate detailed code metrics like SLOC, cyclomatic complexity, and maintainability indices to support rigorous quantitative software analysis.
Curated using a mix of manual review, automated classification, and LLM-assisted semantic validation, these datasets ensure robust and reproducible security research.

A malware source code dataset is a systematically curated, annotated, and analyzed corpus of source-level representations of malicious software. Such datasets are central to empirical investigations of malware development practices, complexity metrics, code reuse, genealogy, and the intersection of secure programming and automated analysis. Modern collections range from highly curated, manually validated repositories to large-scale, multimodal corpora supporting machine learning on code. They form the empirical backbone for research in software engineering, malware evolution, vulnerability inheritance, code summarization, and security analytics.

1. Dataset Structure and Composition

Malware source code datasets vary in scale, class coverage, language support, and representation. Exemplars include “MalSource” (456 samples, 428 families, 1975–2016, multi-language) (Calleja et al., 2018), “MASCOT” (6,032 manually curated Windows specimens, GitHub-only, 32 languages, 2000–2025) (Li et al., 30 Nov 2025), “SourceFinder” (7,504 GitHub repositories classified via supervised methods, multi-platform, 2010–2018) (Rokon et al., 2020), SBAN (676,151 malware files across binary, assembly, source code, and NL descriptions; ≈18.6% malware in 3.6M overall files) (Jelodar et al., 21 Oct 2025), and MALSIGHT (MalS: 89,609 C functions with LLM-generated summaries; MalP: 500 decompiled pseudocode functions with human NL annotations) (Lu et al., 26 Jun 2024).

Table: Summary Attributes of Key Malware Source Code Datasets

Dataset	Scale	Languages/Formats	Temporal Coverage
MalSource	456 samples, 428 families	C/C++, ASM, scripting	1975–2016
MASCOT	6,032 specimens	32 (Python, C++, etc.)	2000–2025
SourceFinder	7,504 repositories	Platform-varied	2010–2018
SBAN	676,151 files	C, C++, decompiled	Multiple feeds
MALSIGHT	89,609 functions (MalS) / 500 pseudocode (MalP)	C, decompiled pseudo	~2022–2024

Most contemporary datasets provide per-sample metadata (specimen/repo ID, family, category, language, file count, SLOC, code quality metrics), and are organized as directory trees or indexed archives with machine-parseable description files.

A plausible implication is that the expansion in dataset granularity (function-level, call-graph ordering, cross-modal alignment) supports increasingly sophisticated static and dynamic program analysis, pre-training of LLMs, and cross-layer security analytics.

2. Collection Methodologies and Curation Practices

Sourcing malware source code employs multi-pronged strategies: mining historical archives (VX Heaven, e-zines), targeted GitHub queries (malware keyword lists), snowball sampling via reference following, and filtering candidates with static program analysis and manual review.

The “MalSource” and “MASCOT” datasets employ manual verification—compilation tests, metadata validation, and elimination of trivial or non-malicious code blobs (Calleja et al., 2018, Li et al., 30 Nov 2025). “SourceFinder” demonstrates automated supervised classification of repos, extracting feature vectors from text fields (title, description, README, filenames) and applying Multinomial Naive Bayes to achieve 89% precision and 86% recall on labeled data (Rokon et al., 2020). Deduplication and pruning of cosmetic forks are standard in MASCOT.

SBAN and MALSIGHT ingest samples from diverse feeds (BODMAS, SOREL-20M, MalwareBazaar, DIKE, xLangKode) and apply NLP and LLM-based semantic verification on code and assembled samples (Jelodar et al., 21 Oct 2025, Lu et al., 26 Jun 2024).

Common schema features include tagging by malware class, platform, family, and behavioral labels. Provenance (commit hash or first-seen date), file-level and function-level attributes (SLOC, comment ratio, cyclomatic complexity), and compilation/analysis outcomes (VirusTotal or AVClass2 labels, Cppcheck vulnerability scans) are systematically recorded.

This suggests that blending manual review with automated classification and semantic labeling maximizes dataset veracity and metrical depth, supporting lineage and vulnerability inheritance analysis.

3. Code Metrics, Software Engineering Quality, and Development Cost Estimation

Malware source code datasets provide detailed static and dynamic code metrics, enabling formal software engineering analysis of malicious development.

Key metrics include:

SLOC (Source Lines of Code): Physical, non-comment, non-blank lines. MalSource samples vary from <1K (single-file viruses) to ~180K (complex RATs/botnets) (Calleja et al., 2018).
Cyclomatic Complexity (CC): $CC = E - N + 2P$ , calculated on each function’s CFG (Calleja et al., 2018, Li et al., 30 Nov 2025).
Function Points (FP): SLOC-based estimates via backfiring (MalSource) or IFPUG guidelines (MASCOT); FP approximates functionality independent of language.
Maintainability Index (MI): Oman formula, $MI = 100 (171 - 5.2 \ln \overline{V} - 0.23 \overline{CC} - 16.2 \ln \overline{SLOC}) / 171$ ; high MI denotes ease of maintenance (Calleja et al., 2018).
Comment-to-Code Ratio (CR): $CR = 100 \times (\#\text{comment lines} / \text{SLOC})$ (Li et al., 30 Nov 2025).
Execution Paths (EP): Total acyclic execution paths per specimen.
Halstead Volume (V): $V = N \cdot \log_2(n)$ for module vocabulary analysis.

Development cost (COCOMO Organic Model):

Effort (person-months): $E = a \cdot (KLOC)^{b}$ , with typical $a=2.4$ , $b=1.05$ [Boehm 1981].
Time (months): $D = 2.5 \cdot E^{0.38}$
Team size: $P = E / D$

Datasets record trends in complexity, demonstrating exponential growth (doubling every 5–6.5 years) in SLOC, function points, and estimated effort through the 1980s–2010s (Calleja et al., 2018). MASCOT finds that while scale peaked around 2016, code complexity (average CC rising from ~2 to ~3.7, EP from ~312 to ~502) and standardization have increased, with development schedules and team sizes shrinking in step with mainstream engineering trends (Li et al., 30 Nov 2025).

A plausible implication is that malware projects reflect modern software practices, balancing compactness and high internal complexity to achieve rapid development cycles.

4. Code Reuse, Genealogy, and Vulnerability Propagation

Clone detection and lineage analysis are integral to understanding malware ecosystem dynamics. Techniques such as AST-based clustering (Deckard), normalized pairwise diff (Ratcliff-Obershelp), and function-level connects quantify code reuse and evolutionary relationships (Calleja et al., 2018, Li et al., 30 Nov 2025).

Clone Categories (MalSource):

Operational utilities/data structures
Core artifacts (infection, propagation)
Static data blocks (IP lists, passwords)
Anti-analysis modules (packers, AV-killers)

MASCOT applies multi-view genealogy: specimen–specimen connection strength $S_{i,j} = \sum_{f \in F_i} \sum_{g \in F_j} w(f,g)$ (where $w(f,g) = 1$ for Deckard-cluster matches), and directionality $D_{i \rightarrow j}$ by timestamp, mapping overall and detailed lineage DAGs as clusters of functional reuse (Li et al., 30 Nov 2025). Early virus/worms seed numerous subsequent malware classes (Grayware, Keylogger, Ransomware), with function-level reuse and vulnerability inheritance prominent; e.g., 65% of Mydoom-derived specimens retain specific CWEs vs. 7.1% baseline, indicating propagation of both functionality and flaws.

Most cloning occurs among samples released within 1–4 years—a pattern consistent with toolkit development, group-shared codebases, or rapid forking (Calleja et al., 2018).

This suggests evolutionary “core” malware serve as progenitors for broader code ecosystems, and vulnerabilities persist through code lineage unless actively refactored.

5. Multimodal, Machine Learning, and Summarization Extensions

Recent datasets address cross-representation learning, code summarization, and automated mining via multimodal structuring. SBAN provides binary, assembly, source code, and NL description for each sample, enabling simultaneous analysis and model training on aligned views (Jelodar et al., 21 Oct 2025). All layers are linked by a unique sample ID, supporting supervised, contrastive, or generative pre-training strategies.

MALSIGHT’s MalS and MalP enable benchmarking of binary malware summarization: C-sourced and decompiled pseudocode functions with LLM and human annotations, 20 functional categories, reverse call order, dynamic/static tagging, and BLEURT-sum metrics (Lu et al., 26 Jun 2024). MalS functions power end-to-end model fine-tuning for code→NL abstraction, while MalP supports robust evaluation on human-curated pseudocode.

Such datasets routinely apply:

NLP tokenization and normalization (BPE ~50K vocabulary in SBAN) (Jelodar et al., 21 Oct 2025)
LLM-assisted semantic verification (SBERT, GPT-3.5, CodeBERT, Qwen2.5)
Algorithmic annotation/deduplication across modalities

A plausible implication is that scale and alignment across binary, assembly, and source-layer data are now critical for training robust, context-aware code analysis models (e.g., transformer-based malware detection and explanation systems).

6. Accessibility, Licensing, and Ethical Considerations

Dataset availability ranges from research-only (MalSource by request; MASCOT public (Li et al., 30 Nov 2025)) to full open access (SBAN under CC-BY-NC-4.0, SourceFinder and MALSIGHT via GitHub) (Jelodar et al., 21 Oct 2025, Rokon et al., 2020, Lu et al., 26 Jun 2024). Data inherit original repository or dataset licenses (MIT, GPL, Apache, proprietary), and usage is typically restricted to academic or defensive research.

Ethical constraints include prohibition of republishing or deploying active malicious code, mandatory respect for source copyrights, and strict compliance with API or decompiler rate limits (IDA, Ghidra). Collection is confined to publicly visible sources, excluding darknet or honeypot material in the MASCOT schema (Li et al., 30 Nov 2025).

7. Applications and Limitations

Malware source code datasets underpin a spectrum of research and engineering tasks:

Large-scale malware classification, feature engineering, and code mining
Evolutionary and genealogy analyses (lineage tracing, code reuse mapping)
Code summarization and explanation via LLMs
Automatic vulnerability detection, security dependency mapping, and function-level risk assessment
Ground-truth corpora for benchmarking program analysis and decompilation
Threat intelligence correlation (tying source-level code to exploits/binaries)

Known limitations include:

Selection bias toward public, English-centric, and GitHub-visible samples (Rokon et al., 2020)
Incomplete lineage tracking due to missing forks/related samples
Error rates in supervised tagging (SourceFinder: ≈11% missed true malware; MALSIGHT: annotation extractor F1 ≈97%) (Rokon et al., 2020, Lu et al., 26 Jun 2024)
GitHub API and rate restriction/coverage limits

These constraints must be recognized when generalizing findings or applying insights to unseen malware classes or code bases.

Malware source code datasets represent a mature, highly structured foundation for empirical malware research—bridging software engineering rigor, evolutionary mapping, multimodal analytics, and secure programming insights across multiple decades, languages, and paradigms. Their continued expansion and refinement foster reproducible investigation into the malware ecosystem, vulnerability propagation, and the operationalization of security-aware machine learning.