Code Vulnerability Detection

Updated 13 December 2025

Code vulnerability detection is the automated identification of security flaws in code through static and dynamic analyses, leveraging representations like source code and binaries.
It employs advanced learning techniques such as graph neural networks and transformer models to precisely identify and localize vulnerabilities.
Privacy-preserving and language-agnostic approaches ensure secure analysis while addressing challenges like class imbalance, scalability, and interpretability.

Code vulnerability detection encompasses the automated identification of security weaknesses in program source or binary code that could be exploited by attackers. It forms a critical pillar of software assurance, aiming to prevent attacks such as code injection, privilege escalation, or denial of service. Research in this area spans the extraction of program semantics, the specification of vulnerability patterns, advanced learning techniques (including LLMs and graph neural networks), program-specific static and dynamic analyses, and—more recently— approaches to maintain code confidentiality during analysis. The field is characterized by the interplay between static and dynamic vulnerability analysis, machine-learned code semantics, language- and representation-agnostic detection, and secure/privatized analysis protocols.

1. Core Problem Formulation and Taxonomy

Code vulnerability detection tasks can be classified by prediction granularity (file, function, line, or token), representation (source, IR, binary), vulnerability type (binary flag, multi-class CWE, or multi-label prediction), and operational scenario (e.g., pre-merge commit screening, post-release binary vetting, or privacy-preserving analysis).

Problem Models:

Binary/multiclass classification: Given a code snippet $X$ (function, file, or code gadget), learn a classifier $f_\theta(X) \rightarrow y$ where $y\in\{0,1\}$ (vulnerable/safe) or $y \in \{CWE_1,\ldots,CWE_t\}$ for type-aware analysis (Ouchebara et al., 9 Dec 2025, Jiang et al., 24 Dec 2024, Hanif et al., 2022).
Fine-grained localization: Learning $p(y_t|x_t,C_t)$ , where each line or token $x_t$ is classified as (non)vulnerable, possibly conditioned on learned program context $C_t$ (Mahyari, 2022, Tanwar et al., 2021).
Commit-level (“Just-In-Time”) detection: Discriminating “dangerous” from “benign” code commits just before merging, using semantic representations of the code changes (Nguyen et al., 2023).
Binary-level detection/patch presence: Determining whether compiled binaries (including stripped or obfuscated code) still manifest a known vulnerability, i.e. “1-day” scenario (Dong et al., 29 Jan 2025, Chukkol et al., 13 Aug 2024).

Data Modalities:

Source code (Liu et al., 15 Apr 2024, Gajjar et al., 15 Jul 2025, Du et al., 6 Jun 2024)
Intermediate representations (LLVM IR, bytecode) (Mahyari, 2022)
Code property graphs (AST, CFG, DFG combined) (Lekssays et al., 22 Jul 2025, Liu et al., 15 Apr 2024, Nguyen et al., 2023)
Binary code/pseudocode (Dong et al., 29 Jan 2025, Chukkol et al., 13 Aug 2024)

Vulnerability Type Coverage:

General binary classifier (vulnerable vs. safe) (Lekssays et al., 22 Jul 2025)
Fine-grained classes: Specific CWEs (e.g., buffer overflows, infinite loop, use-after-free) using multi-class or multi-label setups (Liu et al., 15 Apr 2024, Mahyari, 2022).

2. Detection Methodologies: Representations and Learning

Static and Graph-based Analysis:

AST/PDG/CPG-based GNNs: Many methods encode code as graph structures, with message passing capturing both lexical and semantic features. FGVulDet constructs code property graphs with multi-typed edges, and edge-aware GGNNs capture the semantics of distinct vulnerabilities (Liu et al., 15 Apr 2024). Hierarchical approaches first detect at the whole-file or function, then refine to lines responsible for vulnerabilities (Mahyari, 2022).
Slicing methodologies: Extraction of vulnerability-centric program slices—via PDG, backward/forward slicing, or taint propagation—is recurrent. For instance, focused slices preserve key statements and dependencies but remove unrelated code, boosting learning generalization (Huang et al., 20 May 2024, Lekssays et al., 22 Jul 2025).

Machine Learning Approaches:

Token/sequential models: LSTM, BLSTM, GRU, and BGRU are employed on token, word2vec, or bag-of-words embeddings of code gadgets or function slices (Wartschinski et al., 2022, Huang et al., 20 May 2024).
Attention Fusion and Transformer Architectures: Architectures fusing self-attention, convolutional, and recurrent branches yield high precision while preserving explainability via decoder attention (Tanwar et al., 2021).
LLMs: Fine-tuned LLMs (Llama-2/3, Qwen2.5, StarCoder) are adopted for their contextual understanding and scaling properties. LoRA is frequently used for parameter-efficient tuning (Jiang et al., 24 Dec 2024, Ouchebara et al., 9 Dec 2025, Du et al., 6 Jun 2024). Multi-task instruction fine-tuning, as in VulLLM, simultaneously addresses classification, line-level localization, and explanation generation, yielding superior robustness (Du et al., 6 Jun 2024).
Collaborative and Hybrid Models: M2CVD leverages high-level vulnerability explanations from LLMs to augment code model inputs, showing improved accuracy by aligning textual and code representations (Wang et al., 10 Jun 2024).

Binary-Focused and Quantum Approaches:

Decompilation-based pipelines (e.g., VulCatch) combine neural decompilation with classical analysis and Kolmogorov–Arnold Network layers to process semantic pseudocode extracted from binaries (Chukkol et al., 13 Aug 2024).
Quantum neural circuits (QLSTM) are explored for code vulnerability detection, showing performance and runtime improvements under simulation for certain representations (Akter et al., 2023).

3. Privacy-Preserving and Confidential Analysis

Confidential Code Analysis (CCA) addresses code privacy concerns by enabling static taint and vulnerability analysis to occur entirely on encrypted code (Martins et al., 15 Jan 2025). This involves:

Converting code into an intermediate token language (ITL), representing code logic as a Data and Control Flow Graph (DCFG), then linearizing and encrypting with a mix of deterministic (DET), randomized (RND), and order-revealing (ORE) encryption schemes.
Static vulnerability analysis proceeds as encrypted searches in an inverted index, where the analyzer, given a set of task-related trapdoors, reconstructs encrypted vulnerability paths without code exposure. Decryption occurs only for validated paths and locations.
This paradigm ensures code privacy with limited leakage, but faces overhead (average 42.7% in storage and runtime) and limitations in handling inter-procedural flows and rich language features.

4. Language and Representation Agnosticism

Multi-language and Cross-domain Generalization:

MalCodeAI demonstrates practical LLM-based vulnerability detection across 14 programming languages via a two-phase decomposition–analysis pipeline, leveraging large pre-trained models and LoRA adapters to scale cross-syntax reasoning (Gajjar et al., 15 Jul 2025).
Empirical studies have established that LM-based approaches achieve the highest F1 scores in high-level languages (JavaScript F1≈70%), while results in C/C++ (F1<45%) lag, with limited correlation to code complexity metrics (Atiiq et al., 20 Dec 2024).
Training on diverse fine-grained vulnerabilities and across multiple libraries/datasets, as in FGVulDet, increases recall and resilience to code clones and adversarial examples (Liu et al., 15 Apr 2024, Du et al., 6 Jun 2024).

Handling Code Structure:

Methods such as LLMxCPG integrate code property graphs with LLMs, using CPG-guided slicing to produce concise, context-preserving inputs that generalize across function and project scopes, providing up to 40% higher F1 than previous baselines (Lekssays et al., 22 Jul 2025).

5. Empirical Findings, Robustness, and Deployment

Performance Metrics and Leading Results

Model/Approach	Domain/Granularity	Notable Results	Reference
FGVulDet (edge-aware GGNN)	C/Function/CWE	F1 mid-60% (5 types)	(Liu et al., 15 Apr 2024)
Multi-context Attention Fusion	C/AST-path/Line	F1 98% (CWE252, Juliet)	(Tanwar et al., 2021)
VulBERTa (RoBERTa variant)	C/C++/Function	F1 up to 99%, >64% on SOTA	(Hanif et al., 2022)
VulLLM (multi-task LLM)	C/C++/Function	F1 66.5% avg, >64% OOD	(Du et al., 6 Jun 2024)
LLMxCPG (CPG+LLM+slice)	C/Function/Project	+15–40% F1 over baselines	(Lekssays et al., 22 Jul 2025)
CoCoA (encrypted)	PHP/Function	F1=0.70, Precision=0.93	(Martins et al., 15 Jan 2025)
PLocator (binary, 1-day)	C/Binary/Function	TPR 88.2%, FPR 12.9%	(Dong et al., 29 Jan 2025)
VulCatch (binary)	x86/Binary/Gadget	F1 up to 98.9%, FPR 1.5%	(Chukkol et al., 13 Aug 2024)
M2CVD (LLM+code-model)	C/Function	F1=61% (Devign)	(Wang et al., 10 Jun 2024)

Evidence indicates that data- and control-dependent slicing, balanced vulnerability data, fine-grained classifiers, and multi-task LLM tuning substantially improve recall, robustness, and cross-dataset generalization.
Binary-level approaches, notably PLocator and VulCatch, achieve high accuracy even under aggressive code transformations and multiple compilers/optimizations. This suggests advances in program semantics extraction and steady anchors directly in binary are critical for realistic software supply-chain security.
Privacy-preserving detection, as in CoCoA, now approaches standard static analysis performance under specific conditions, indicating viability for sensitive source code auditing workflows.

Key Limitations and Open Challenges

Class imbalance is the single most significant barrier to high recall and F1: most models exhibit dramatic recall/F1 degradation when the vulnerable-to-safe function ratio drops below 20% (Jiang et al., 24 Dec 2024).
Scalability and interpretability: While LLM-based detectors scale to large languages and long contexts, efficient fine-tuning, model compressibility, and explainability remain active areas of work (Ouchebara et al., 9 Dec 2025, Du et al., 6 Jun 2024).
Inter-procedural analysis and context preservation: Many methods are still limited at the function or file level, with partial solutions introduced via program slicing and CPG-guided integration (Lekssays et al., 22 Jul 2025).
Code privacy and encrypted analysis: Fully confidential analysis is still hampered by control-flow granularity, language/domain support, and cryptographic overheads (Martins et al., 15 Jan 2025).

6. Future Directions and Extensions

Expansion to additional languages and vulnerability classes remains a priority, as does the construction of higher-fidelity, balanced datasets and code clone/cross-project detection benchmarks (Atiiq et al., 20 Dec 2024, Bui et al., 22 Jul 2025, Liu et al., 15 Apr 2024).
Hybrid LLM–graph architectures and multi-task training are indicated as promising avenues to further enhance generalization, robustness to adversarial modification, and contextual understanding (Lekssays et al., 22 Jul 2025, Du et al., 6 Jun 2024).
Integration with CI/CD and binary supply chain monitoring facilitates deployment in real-world development and production pipelines (Gajjar et al., 15 Jul 2025, Dong et al., 29 Jan 2025).
Program privacy and confidential analysis is forecast to incorporate leakage-resistant encrypted search and interface with privacy-preserving hardware (e.g., secure enclaves), while advancing toward efficient Oblivious RAM solutions (Martins et al., 15 Jan 2025).
Quantum acceleration of feature extraction and classification may become practical as hybrid classical-quantum interfaces mature (Akter et al., 2023).

7. Practical Guidelines and Comparative Summary

For short code snippets with high class imbalance, fine-tuned medium-size models (CodeBERT, UniXcoder) remain highly competitive; for long contexts (>512 tokens), LoRA- or QLoRA-tuned large LLMs are currently superior (Jiang et al., 24 Dec 2024, Ouchebara et al., 9 Dec 2025).
Slice or embed code with full data/control context and balance vulnerable/non-vulnerable classes through over/undersampling or augmentation to maximize real-world detection performance (Liu et al., 15 Apr 2024, Huang et al., 20 May 2024).
Where code privacy is paramount, encrypted static analysis can now offer near-parity with plaintext analysis on selected languages and vulnerability classes (Martins et al., 15 Jan 2025).
At the binary level, semantic-preserving decompilation and path/context signatures significantly enhance the resilience and accuracy of detection (Dong et al., 29 Jan 2025, Chukkol et al., 13 Aug 2024).
Multi-model collaboration approaches leveraging LLM-generated explanations demonstrably enhance the semantic alignment and practical accuracy of code models (Wang et al., 10 Jun 2024).

In conclusion, code vulnerability detection now synthesizes program analysis, multi-modal learning, adversarial robustness, language-agnosticism, and privacy to meet the demands of both evolving attacker sophistication and contemporary software development practices.