Malware Source Code Datasets
- Malware source code datasets are curated collections of malicious software code that reveal explicit logic, structure, and evolutionary relationships.
- They employ multi-stage filtering, labeling, and clone detection methodologies to ensure data integrity and representative coverage across various malware families.
- These datasets drive advanced research in static analysis, code summarization, and lineage mapping, enhancing detection, attribution, and cybersecurity strategies.
Malware source code datasets are curated collections of the underlying program code used in malicious software. These datasets serve as foundational resources for security research, enabling empirical studies of malware engineering, lineage, code reuse, summarization, and the training and evaluation of analysis tools, including LLMs for code intelligence. Unlike collections of binaries, source code datasets expose explicit logic, structure, and evolutionary relationships, thereby supporting sophisticated static and semantic analyses of adversarial programming.
1. Major Datasets: Composition and Properties
A number of large-scale, systematically curated malware source code datasets have emerged, with each providing unique coverage across operating systems, languages, taxonomy, and granularity.
- MASCOT: Contains 6,032 Windows-based malware specimens from November 2000 to February 2025, encompassing 32 programming languages (Python, C++, C, C#, Assembly, Go, HTML, PowerShell, Rust, and others). Malware is labeled by class (12 AVClass2/ClarAVy-based categories), family, behavior, vulnerabilities (CWEs), packer, and FUD status. Each sample’s scale and code quality are recorded using metrics such as SLOC, function points, cyclomatic complexity, and execution paths. Genealogical structure is established via function-level code clone analysis, quantifying directed code reuse among families and specimens (Li et al., 30 Nov 2025).
- MalSource: Spanning from 1975 to 2016 with 456 samples from 428 families, MalSource focuses on “classic” malware (viruses, worms, RATs, exploit kits) obtained from leaks, underground forums, and public archives. Assembly and C/C++ dominate (92% of samples), with additional coverage for Visual Basic, Delphi/Pascal, Python, and interpreted languages. Metadata includes detailed build instructions, type/family labels, and language breakdowns (Calleja et al., 2018).
- SourceFinder: Identifies 7,504 malware source code repositories from the October 2019 GitHub corpus (32M repositories), using a supervised bag-of-words model for repository metadata classification (89% precision, 86% recall). Encompasses C/C++, Python, Java, JavaScript, PHP, Go, and other languages, with broad platform coverage (Windows, Linux, MacOS, IoT, Android). Annotation includes type (keylogger, ransomware, etc.) and target platform; repository-level social metrics are provided (Rokon et al., 2020).
- CAMA (Android): Curates 118 Android APKs (13 families), decompiled to yield 7.54 million distinct Java methods. Samples are deduplicated at category and family levels to ensure representative coverage. Each method’s source body, line count, and metadata are included for function-level analysis or summarization (He et al., 1 Apr 2025).
- MalS/MalP (Summarization Benchmarks): MalS comprises 89,609 C functions with LLM-generated, expert-refined natural language summaries, derived from 2,289 GitHub malware repos found primarily via SourceFinder. MalP is a benchmark of 500 hand-annotated IDA-style pseudocode functions from 20 real malware families, with associated call graphs and expert-generated summaries. Both are designed for evaluation and training of malware summarization systems (Lu et al., 2024).
| Dataset | Specimens / Units | Span | Platforms/Types | Languages |
|---|---|---|---|---|
| MASCOT | 6,032 specimens | 2000–2025 | Win (12 AVClass2/ClarAVy) | 32 (Py, C++, C, C#, ASM, …) |
| MalSource | 456 samples (428 fam.) | 1975–2016 | Viruses, worms, RATs, etc. | 14+ (ASM, C/C++, VB, ...) |
| SourceFinder | 7,504 repos | 2008–2019 | Windows, Linux, Mac, IoT, etc. | C/C++, Py, Java, JS, … |
| CAMA | 118 APKs, 7.5M methods | – | Android (13 families) | Java (decompiled) |
| MalS/MalP | 89,609 fns / 500 fns | – | C-source, IDA pseudocode | C |
2. Curation and Validation Methodologies
Dataset integrity, ground truth, and representativeness are enforced through multi-stage pipelines:
- Repository and Keyword Mining: GitHub and public archives are queried via extensive, multi-lingual keyword lists (family names, malware classes, behaviors), with ranking and coverage maximized through query permutations and source cross-linking (Li et al., 30 Nov 2025, Rokon et al., 2020).
- Automated and Manual Filtering: Specimens are regularly verified for authentic source code presence (README, directory analysis), with trivial forks and stubs dropped. Manual review eliminates duplicates, non-code artifacts, and demo/sampleware with no observed malicious functionality (Li et al., 30 Nov 2025, Calleja et al., 2018).
- Labeling and Metadata Augmentation: Labels such as malware class, behavioral description, family, vulnerability (CWE), packer, and detection-resistant (FUD) status are propagated from VirusTotal/AVClass2, GitHub, or annotator analysis. Not every sample obtains full label coverage (e.g., only 9.6% of MASCOT samples acquire the family label) (Li et al., 30 Nov 2025).
- Open-Source Codeview: No binaries are distributed; all datasets release only disarmed source (post-stripping of payloads and sensitive data), often requiring users to agree to research-only licensing with restrictions on redistribution, re-compilation, and live deployment (Calleja et al., 2018, Li et al., 30 Nov 2025, Rokon et al., 2020).
- Deduplication and Category Balancing: For Android (CAMA), near-duplicate APKs are eliminated based on size and method count similarity within families to achieve balanced, representative coverage (He et al., 1 Apr 2025).
3. Software Engineering, Analytical, and Security Metrics
Advanced metrics cover size, cost, quality, and reuse, with all formulas fully specified:
- Size:
- Source lines of code (SLOC): Computed excluding comments/blanks.
- Files: Number of source files per sample.
- Function points (FP): User-centric measure based on empirical SLOC-per-FP tables, e.g., (Calleja et al., 2018).
- Development Cost (COCOMO):
- Basic:
- Time:
- Team size: (Li et al., 30 Nov 2025, Calleja et al., 2018).
- Quality:
- Comment-to-Code Ratio:
- Cyclomatic Complexity:
- Maintainability Index (MalSource):
- Execution paths, call graph analytics, and system call/API enumeration (Li et al., 30 Nov 2025, Calleja et al., 2018).
Vulnerability/Dependency:
- CWE detection (e.g., CWE-398, CWE-561, CWE-476) via Cppcheck.
- System and API call extraction using extensive Windows signatures and syscall lists, with epoch stratification for historical analysis (Li et al., 30 Nov 2025).
4. Code Reuse, Genealogy, and Evolutionary Analysis
Rigorous clone detection and genealogy mapping underpin malware lineage studies:
- Clone Detection:
- Function-level: Deckard (AST-based clustering), LSH for high-volume clustering, and minimum clone size thresholds (100 AST tokens in MalSource) (Calleja et al., 2018, Li et al., 30 Nov 2025).
- Textual similarity: Ratcliff–Obershelp for language-agnostic matching, with SLOC-based cutoffs for clone reporting.
- Post-filtering for boilerplate, include guards, and macro-generated code (Calleja et al., 2018).
- Genealogy Construction (MASCOT):
- Per-sample code reuse weights aggregate clone counts, strictly directed by commit timestamp ( is ancestor).
- Visualization as weighted, directed graphs at category and specimen level. Edges annotated with function tag-sets provide interpretable semantic inheritance (Li et al., 30 Nov 2025).
- Ancestor-descendant chains recover multi-decade lineages; e.g., X0R-USB’s routines reused through 2021, with FUD children inheriting unrefactored parent code.
- Empirical Findings:
- Reuse drives persistent code vulnerabilities: e.g., Mydoom-derived specimens share CWE-467 (65% vs. 7.1% overall), supporting the view of exploit inheritance (Li et al., 30 Nov 2025).
- Core logic and anti-analysis components are most cloned across families (MalSource), aligning with expectations for modular malware architecture (Calleja et al., 2018).
5. Specialized Datasets for Summarization and LLM Benchmarking
Emerging focus on automated understanding and summarization of malware code has motivated new annotation paradigms:
- MalS/MalP:
- MalS provides large-scale, function-level C source paired with LLM-generated, human-refined English summaries focused on malicious semantics (89,609 instances).
- MalP delivers manually crafted summaries for decompiled IDA pseudocode, with call-graph context and summary style guidelines ensuring accurate behavioral coverage (Lu et al., 2024).
- CAMA:
- Extends the function granularity paradigm to Android: functions extracted from 118 decompiled APKs across 13 families (7.54M unique methods), with metadata scaffolding for future downstream tasks (summaries, name recovery, maliciousness scores) (He et al., 1 Apr 2025).
- Annotation and Evaluation:
- Automated summarization benchmarks: BLEURT-sum, fine-tuned for code-summary alignment, achieves on validation. Models trained on MalS and benign pseudocode demonstrate close human–metric correlation for usability and completeness (e.g., BLEURT-sum = 47.22 for real-world test, with usability) (Lu et al., 2024).
6. Access, Ethics, and Limitations
Availability and usage constraints reflect ethical imperatives and practical constraints:
- Public Access: Most datasets are mirrored on platforms such as GitHub, Zenodo, or IEEE DataPort; usage is restricted to non-commercial, research, and educational applications, often enforced via license agreements (Calleja et al., 2018, Li et al., 30 Nov 2025, He et al., 1 Apr 2025).
- Disarmament: Published sets contain only source code, excluding all operational payloads, build artifacts, or binaries, to prevent weaponization and legal non-compliance (Li et al., 30 Nov 2025, Calleja et al., 2018).
- Curation Bias and Coverage: Coverage is non-uniform—differences in leak availability, platform focus (Windows bias in MASCOT, Android in CAMA, mixed in SourceFinder), and the absence of commercial and nation-state malware result in inherent sample bias. Most datasets comprise single-version specimens; intra-family evolution is incompletely captured outside genealogical mappings (Calleja et al., 2018).
- Clone Detection Limits: High false positive rates in AST-based (Deckard) methods; string-diff techniques miss refactored or obfuscated clones; cross-validating genealogy at the binary level is recommended for high confidence (Li et al., 30 Nov 2025, Calleja et al., 2018).
- Adherence to Ethics: Users must comply with local export-control, informatics ethics, and responsible research guidelines when handling malware-associated artifacts or performing live analyses (Li et al., 30 Nov 2025).
7. Research Impact and Applications
Malware source code datasets catalyze multifaceted research at the intersection of cybersecurity, software engineering, and AI:
- Longitudinal Engineering Studies: Systematic tracking of SLOC, file count, function points, and maintainability reveals exponential growth in malware complexity and convergence with mainstream engineering practices—yet with persistent deficits in documentation and modularity (Calleja et al., 2018, Li et al., 30 Nov 2025).
- Evolution and Attribution: Fine-grained mapping of code reuse drives advanced evolutionary studies and assists in malware provenance, variant clustering, and attribution (Li et al., 30 Nov 2025).
- Detection and Summarization: Data support the design and benchmarking of supervised detectors, code-LM summarizers, and function name predictors under realistic, noisy code conditions (Lu et al., 2024, He et al., 1 Apr 2025).
- Education, Benchmarking, and Tooling: Datasets underpin reproducible experiments, curriculum development, and the objective evaluation of reverse engineering, code deobfuscation, and static/dynamic analysis pipelines (Rokon et al., 2020).
A plausible implication is that ongoing expansion and refinement of these datasets—including contribution of multiple versions per family, integration of richer metadata, and adoption of standardized annotation and evaluation frameworks—will remain essential for the evolution of malware analysis research.