MalCVE: Malware Detection & CVE Attribution
- MalCVE is a modern cybersecurity framework that fuses reverse engineering, machine learning, and LLM-based summarization to detect malware and associate binaries with specific CVEs.
- It employs an eight-stage modular pipeline—including decompilation, semantic similarity retrieval, and BM25 re-ranking—to achieve 97% malware detection accuracy and 53–65% CVE recall.
- The tool delivers explainable outputs with structured JSON reporting, enabling both forensic analysis and proactive defense at a low per-file cost.
MalCVE refers to a modern approach to both malware detection and the attribution of binaries to specific Common Vulnerabilities and Exposures (CVEs), combining reverse engineering techniques, machine learning, and LLMs. The term is also used (as in "MalCVE: Malware Detection and CVE Association Using LLMs" (Cristea et al., 17 Oct 2025)) for a concrete tool and methodology that demonstrates how these techniques can be fused to automate the dual tasks of identifying malware and the vulnerabilities it may exploit. The framework operates directly on compiled binaries (specifically Java JAR files in the referenced implementation), and is notable for its modular pipeline, explainable outputs, and ability to associate observed malware behaviors with well-defined vulnerability identifiers.
1. Overview and Motivation
MalCVE addresses two chronic challenges in cybersecurity: (1) scalable, accurate malware detection without the need for extensive proprietary or fine-tuned models, and (2) the association of discovered binaries with the specific software vulnerabilities (CVEs) they are likely to exploit. Traditional solutions either focus on source code or require commercial engines with opaque procedures and high per-file costs. MalCVE leverages advances in LLMs and retrieval-augmented generation to provide precise, explanation-rich results for both malware detection and vulnerability mapping in binary artifacts (Cristea et al., 17 Oct 2025).
This dual capability is crucial for both incident forensics (understanding how malware relates to historical vulnerabilities) and proactive defense (identifying emerging exploits and patch priorities). The approach is exemplified by an open-source, cost-efficient pipeline capable of operating on large sets of binaries.
2. Pipeline Architecture
The MalCVE pipeline comprises eight sequential stages, each designed to transform, interpret, or augment the binary under analysis:
- Decompilation: Both CFR and Procyon decompilers are used. Fallback from one to the other increases robustness against obfuscation and non-standard bytecode arrangements.
- Deobfuscation: Decompiled Java code is sanitized by a custom tool built with JavaParser, reversing simple obfuscation (e.g., trivial string transformations) and normalizing identifiers to improve downstream code understanding.
- LLM-Based Code Summarization: The cleaned code is presented to the LLM with a structured zero-shot prompt. The model acts as an analyst, returning a JSON summary containing a malware verdict, rationale, top activities, relevant indicators, and libraries.
- CVE Query Generation: With the previous summary as context, the LLM generates succinct keyword lists describing attack techniques, implicated APIs, and salient code features, filtering out noise from obfuscated artifacts.
- Semantic Similarity Retrieval: These queries are embedded using a dense vector model (OpenAI text-embedding-3-small, 1536 dimensions) and used in an Approximate Nearest Neighbor (ANN) search over a Milvus vector database containing CVE/NVD descriptions (recall: up to 100 candidate CVEs per query).
- BM25-Based Re-ranking: The candidate CVEs are re-ranked by computing BM25 similarity between tokenized library names from the code and the CVE descriptions. The final score is a weighted sum:
- LLM-Based CVE Classification: The top candidate CVEs, summary, and code snippets are fed to the LLM, which outputs a structured JSON specifying the matched CVE, a rationale, and behavioral summary.
- Reporting: All outputs—including intermediate summaries, verdicts, queries, candidate CVEs, scoring metadata, and prompt logs—are saved for audit, integration, or manual review.
This architecture highlights a consistent pattern: coarse-to-fine processing, where computationally intensive LLM reasoning is sandwiched between highly targeted IR-based filtering and vector similarity retrieval, optimizing both speed and interpretability.
3. LLM Integration and Prompt Engineering
LLMs are central at three stages:
- Structured Summarization: The model extracts verdicts, explanations, activities, and indicators in a rigid JSON schema. The prompt instructs the model as a “malware analyst,” ensuring outputs are amenable to both automation and review.
- Search Optimization: The keywords for retrieving CVEs are not just standard code tokens but are synthesized by the LLM to focus on relevant APIs, attack primitives, and TTPs, sidestepping misleading elements from packed or obfuscated code.
- CVEs and Explanation: Final association and explanation are again LLM-driven, with the prompt directing the model to select a CVE, justify the mapping, and produce a semantically coherent summary.
Crucially, the process is zero-shot throughout—no fine-tuning is required—demonstrating that current LLMs generalize sufficiently to produce high-quality cybersecurity diagnostics from prompt engineering alone.
4. Performance and Benchmarking
MalCVE was evaluated on 3,839 decompilable JAR executables (from the MalDICT dataset, (Joyce et al., 2023)), measuring both detection and attribution performance:
- Malware Detection: Achieved a mean accuracy of 97%, matching or exceeding fine-tuned deep learning baselines such as MalBERT, but with far lower operational cost (approximately \$0.03 per file—about ~1/66th that of CrowdStrike Falcon, ~1/80th of ANY.RUN).
- CVE Association: For direct association (top-1 accuracy), the mean was ~39%, with peaks at 44%. However, recall@10 for the top 10 candidate CVEs reached 65% (mean 53%). This recall establishes parity with source-code-based methods, despite operating solely on binaries.
These results indicate that—while top-1 attribution remains an open challenge—MalCVE achieves practical recall for forensic triage and automated pipeline use.
| Metric | Value / Range | Notes |
|---|---|---|
| Detection Acc. | 97% | Mean over JAR binaries |
| CVE recall@10 | 53–65% | Correct CVE found in top 10 candidates |
| Top-1 CVE Acc. | 39–44% | Highest in ideal settings |
| Per-file cost | $0.03 | Zero-shot LLM, compared to commercial solutions |
5. Explainability and Reporting
The tool’s design ensures that every verdict and CVE recommendation is accompanied by a human-interpretable rationale, indicating:
- Detected malicious activities with rationale (file access, network exfiltration, use of system APIs, etc.)
- Concrete technical indicators (URLs, command-and-control servers, suspicious temp files)
- The logic behind CVE association, including the evidence trailing back to specific libraries and code patterns.
By emitting all outcomes in structured JSON and persisting model prompts and outputs, MalCVE fosters traceability—critical for both incident response and academic reproducibility.
6. Limitations and Future Directions
While MalCVE demonstrates competitive performance, several avenues for improvement are noted:
- Language and Binary Format Support: The prototype focuses on Java JAR files using CFR and Procyon; extension to native formats (PE, ELF), or other high-level binaries (DEX, APK) will require additional decompilers and adaptation of prompt design.
- LLM Security and Privacy: Reliance on commercial LLMs can raise privacy concerns when handling proprietary or confidential code. Integration of open-source LLMs is suggested as a mitigation.
- CVE Ranking and Retrieval: Enhancements in retrieval (e.g., non-linear aggregation, alternate ranking schemes, or integration of YARA rules) could further improve attribution accuracy.
- Context Window Limitations: Processing very large binaries may strain LLM context limits. Research into hierarchical summarization and segment-level analysis is proposed.
- Real-time and Scalable Workflows: Given the low cost and high speed, deployment as a front-line filter in continuous integration or incident response pipelines is feasible, but adaptation to higher throughput and streaming analysis will benefit from additional engineering.
Anticipated work also includes addressing multi-language support and improving handling of highly obfuscated samples.
7. Context and Impact in Cybersecurity
MalCVE’s approach and concrete toolset occupy a unique position in the malware analysis landscape:
- Bridging the Gap: Unlike purely signature-based or AV-centric tools, MalCVE builds a bridge between low-level binary analysis and high-level vulnerability management, facilitating both detection and cause attribution.
- Promoting Open and Explainable Security: By exposing all stages of reasoning and by relying on explainable, structured outputs, MalCVE supports both automated workflows and human-in-the-loop review, reducing over-reliance on “black box” commercial solutions.
- Scalability and Accessibility: The low per-file cost, open architecture, and independence from fine-tuning make the approach particularly well-suited to academic and non-profit defenders, as well as for integration into larger threat intelligence ecosystems.
A plausible implication is that methods exemplified by MalCVE foreshadow a shift towards highly-automated, explainable, and vulnerability-aware malware analysis pipelines, leveraging advances in language modeling and IR to move beyond detection to actionable remediation and threat prioritization. Such pipelines can reduce the economic and operational barriers to advanced malware forensics, democratizing access to state-of-the-art analysis capabilities.