Autonomous Vulnerability Discovery

Updated 1 January 2026

Autonomous vulnerability discovery is the automated identification and validation of software and AI system security weaknesses using static, dynamic, and ML techniques.
The approach integrates static analysis, fuzzing, symbolic execution, and ML to achieve scalable and precise vulnerability detection.
It underpins Cyber Reasoning Systems and agentic frameworks that enable continuous, real-world security assessments and exploit validation.

Autonomous vulnerability discovery denotes the automated, human-out-of-the-loop identification and (often) validation of security flaws in software, firmware, or AI-driven systems. This capability forms the foundation of modern Cyber Reasoning Systems (CRSs), end-to-end security assessment pipelines, and AI-powered penetration testing agents. Autonomous discovery is characterized by the orchestration of static and dynamic analysis, symbolic and concolic execution, ML/LLM-based pattern recognition, and feedback-guided search, enabling scalable and previously infeasible continuous security analysis. While traditional efforts focused on program binaries and source code, recent advances extend across domains such as industrial control logic, deep learning frameworks, and safety-critical autonomy stacks.

1. Formal Definitions and Problem Scope

Autonomous vulnerability discovery integrates multiple core components within the vulnerability management pipeline, including detection, exploitation (proof-of-concept triggering), and, often, automated remediation. The detection objective is typically formulated as a function $f: C \to \mathcal{Y}$ , where $C$ is a space of code artifacts (e.g., functions, binaries, code slices) and $\mathcal{Y}$ is a label set: binary ("vulnerable"/"clean"), multi-class (per-CWE), or localization (pointing to precise regions or instructions) (Shereen et al., 2024).

Contemporary systems operate at various granularities:

Function-, line-, or slice-level for source code;
Instruction or basic block for binaries;
API-level or signal graph for black-box DL frameworks;
Scenario or behavior-level for autonomous driving.

The challenge is to systematically explore the input and execution space of a target, identify candidate vulnerabilities with minimal false positives, and in advanced settings, auto-generate exploit or validation artifacts that establish exploitability (Brooks, 2017).

2. Core Methodologies

Autonomous systems implement diverse—and often hybrid—approaches:

2.1 Static Analysis

Classical methods employ graph-based static analysis: control/data-flow graphs, program dependence graphs (PDG), code property graphs (CPG), and domain-specific query DSLs (Joern, CodeQL). Recent neuro-symbolic systems such as MoCQ (Li et al., 22 Apr 2025) utilize LLMs to generate candidate vulnerability patterns in DSL form, iteratively refine them using symbolic validators, and merge/optimize queries for scale and precision.

2.2 Dynamic Analysis and Fuzzing

Coverage-guided and mutation-based fuzzing (e.g., AFL, libFuzzer, Atheris) remain fundamental, often interleaved with static reachability or taint analysis for maximal coverage. In LLM-augmented CRSs such as FuzzingBrain (Sheng et al., 8 Sep 2025), fuzzers are orchestrated with LLM-driven input synthesis, crash stack deduplication, and coverage-based fitness metrics. Type-aware fuzzing targeting native APIs, as in IvySyn (Christou et al., 2022), leverages static type signatures to craft effective mutation pools and immediately synthesize high-level PoC scripts from concrete exploits.

2.3 Symbolic and Concolic Execution

Symbolic execution replaces selected inputs with symbolic variables and uses SMT solvers to enumerate feasible execution paths that violate safety properties. Hybrid engines (Mayhem, Driller) combine this with fuzzing, triggering the symbolic engine when fuzzers stall on hard-to-reach code (Brooks, 2017).

2.4 Machine Learning and Neuro-Symbolic Approaches

ML/LLM-based approaches analyze code semantics, extract exploitable patterns, and, when combined with symbolic analysis, drive tight feedback loops enhancing detection. These systems demonstrate empirically that LLMs can discover and formalize vulnerability patterns missed by human experts, as in the case of MoCQ producing 12 previously unknown static query patterns and uncovering multiple real-world 0-days (Li et al., 22 Apr 2025). Hybrid GNN+transformer or fine-tuned LLM architectures currently set the benchmark on major datasets (Shereen et al., 2024).

2.5 Agentic and Evolutionary Methods

Agentic frameworks (e.g., A2 for Android (Wang et al., 29 Aug 2025)) formalize vulnerability discovery and validation as multi-agent coordination tasks, in which LLM-based agents orchestrate input generation, tool integration, multi-modal exploration, and self-validating PoC production. Evolutionary algorithms, as in Beagle (Costa et al., 2020), employ co-evolving species: tests (GUI event sequences) and contracts (input parameter vectors), scoring candidate exploits by their proximity to triggering vulnerable procedures under specified contracts.

2.6 Adversarial RL/Preference-Guided Discovery

In complex, highly dynamic environments (notably autonomous vehicles), automated discovery uses adversarial multi-agent RL and LLM-designed reward functions to mine diverse, AV-responsible hazardous behaviors (Liu et al., 2021, Qiu et al., 24 Mar 2025). The AED system (Qiu et al., 24 Mar 2025) leverages LLMs for automatic reward shaping and preference-learning-based reward refinement to maximize both diversity and effectiveness of discovered policy-level vulnerabilities.

3. System Architectures and Feedback Loops

Scalable CRSs decompose vulnerability discovery into modular services: static analysis, fuzzing/execution, LLM-based analyzers, PoC/patch generators, and orchestration layers for task distribution (Sheng et al., 8 Sep 2025). Key architectural patterns include:

Feedback loops: LLM or evolutionary queries are iteratively refined in response to symbolic validator feedback (syntax/semantic errors, intermediate program state), enabling trace-driven repair and optimization as in MoCQ (Li et al., 22 Apr 2025).
Agent orchestration: Agentic systems coordinate LLM-based planners, executors, and validation agents via graph-based workflows to transform speculative findings into validated, exploit-confirmed vulnerabilities (Wang et al., 29 Aug 2025).
Corpus management: Shared, decaying corpora maintain diversity and prioritize high-fitness seeds (coverage-increasing inputs) in LLM-augmented fuzzing pipelines (Sheng et al., 8 Sep 2025).
Isolation and safety: Containerization, syscall whitelists, and resource cgroups are standard for safely executing pen-testing commands and in-memory patching (Abdulzada, 14 Jul 2025, Rajput et al., 2022).

4. Domains, Data Types, and Application Scope

While initial focus was on C/C++ binary analysis, autonomous discovery now spans:

Domain	Key Representation/Artifact	Notable Methods/Systems
Program binaries (C/C++)	IR, CFG/DFG, memory snapshots	Mayhem, Mechanical Phish, MoCQ
Source code (multi-language)	AST, code slices, DSL queries	MoCQ, Beagle, VulDeePecker
Deep Learning frameworks	Native kernel APIs, Python PoC	IvySyn (Christou et al., 2022)
Industrial control systems	Data Dependence Graphs, binaries	ICSPatch (Rajput et al., 2022)
Android applications	APK code+metadata, UI/ICC flows	A2 (Wang et al., 29 Aug 2025)
Autonomous driving (behavioral/physical)	Simulated state/action, reward signals	STARS, AED, PlanFuzz

Notable results include discovery and responsible disclosure of zero-day vulnerabilities in production systems (e.g., 39 unique CVEs in DL frameworks (Christou et al., 2022), seven new static code vulnerabilities in PHP/JS codebases (Li et al., 22 Apr 2025), and 104 validated zero-days in production Android apps (Wang et al., 29 Aug 2025)).

5. Evaluation Frameworks, Benchmarks, and Empirical Results

Autonomous vulnerability discovery is empirically evaluated along the following axes:

Detection metrics: Precision, recall, F1-score, recall@fixed FPR, ROC-AUC, PR-AUC, and MCC (Shereen et al., 2024, Li et al., 22 Apr 2025).
Validation: Proof-of-concept exploit generation and PoC-based self-confirmation as in A2 (PoC success implies validated vulnerability) (Wang et al., 29 Aug 2025).
Zero-day discovery: Number of unique zero-days identified in real-world repositories, responsible disclosure, and patch deployment (Li et al., 22 Apr 2025, Christou et al., 2022, Sheng et al., 8 Sep 2025).
Efficiency: Time to discovery, coverage per CPU/GPU-hour, overhead in target execution environments (e.g., median 5s localization latency for ICSPatch (Rajput et al., 2022)).
Benchmarks: AIxCC leaderboard (Sheng et al., 8 Sep 2025), Ghera benchmark (Android) (Wang et al., 29 Aug 2025), function/slice-level C/C++ datasets (Devign, ReVeal, Big-Vul) (Shereen et al., 2024).
Comparisons with expert and tool baselines: MoCQ matches or surpasses expert-crafted DSL queries; LLM-fuzzing CRSs (FuzzingBrain) vastly outperform fuzzing-only baselines; agentic A2 distills order-of-magnitude fewer speculative findings with higher true-positive rate than MobSF/APKHunt (Wang et al., 29 Aug 2025).

6. Limitations, Challenges, and Future Directions

Despite recent breakthroughs, several fundamental challenges persist:

False positives and scalability: High false positive rates (e.g., 60% for neuro-symbolic static analysis in dynamic PHP/JS (Li et al., 22 Apr 2025)) stress human response pipelines; precision enhancement via negative samples and fine-tuning is ongoing.
Generalization and coverage gaps: Limited language support, data/model leakage between training and evaluation, and the dominance of C/C++ in benchmarks restrict broader applicability (Shereen et al., 2024).
Complex input and action spaces: Path explosion in symbolic/concolic execution, and sparse high-value vulnerabilities in deep systems (e.g., planning invariants in AD) challenge search efficiency (Brooks, 2017, Wan et al., 2022).
Explainability and root cause analysis: Standardized, interpretable attribution of discovered vulnerabilities remains underdeveloped (Shereen et al., 2024).
Domain specialization and feedback integration: Extension to highly dynamic and sensor-rich domains (e.g., autonomous driving, ICS) requires new sensing abstractions, domain-specific feedback, and reward shaping (Qiu et al., 24 Mar 2025, Rajput et al., 2022).
Open science and benchmarking: Only half of research code artifacts are open; meta-research calls for standardized, time-based dataset splits and hidden test sets (Shereen et al., 2024).

Future research is converging on diversified CWE-classification, fine-grained and cross-language discovery, open-source benchmarks and code, explainable vulnerability detection, hybrid static/dynamic/ML pipelines, and increasingly agentic, self-improving neuro-symbolic frameworks.

7. Impact and Recent Advancements

Autonomous vulnerability discovery has rapidly transitioned from theoretical research to practical impact. LLM-powered and neuro-symbolic static analysis tools (MoCQ) and LLM-augmented CRSs (FuzzingBrain) have identified previously unknown real-world security flaws overlooked by domain experts and traditional static/dynamic analyzers (Li et al., 22 Apr 2025, Sheng et al., 8 Sep 2025). Agentic discovery systems validate a significant fraction of newly hypothesized vulnerabilities with self-confirming PoCs, reducing analyst burden and integrating directly into CI/CD security pipelines (Wang et al., 29 Aug 2025). In safety-critical domains, evolutionary and RL-based autonomous testers now achieve discovery rates and scenario diversity far beyond handcrafted reward or oracle designs (Liu et al., 2021, Qiu et al., 24 Mar 2025, Wan et al., 2022). As automated and hybrid systems proliferate, and as open benchmarks and reproducible evaluation frameworks become standard, autonomous vulnerability discovery is poised to define the next phase of software and AI system assurance.