- The paper demonstrates that integrating LLMs with traditional analysis enables autonomous vulnerability discovery and patching, achieving 61% coverage and a 72.1% patch success rate.
- The system employs a modular, Kubernetes-based architecture with dynamic resource allocation to support multi-language, ensemble-based fuzzing and static analysis.
- The study highlights that ensemble LLM-driven patching and multi-turn context retrieval enhance patch correctness and robustness across diverse open-source projects.
ATLANTIS: An Expert Analysis of an AI-Driven Autonomous Cyber Reasoning System
Introduction and Context
ATLANTIS is a comprehensive, modular cyber reasoning system (CRS) designed for the DARPA AIxCC competition, which required fully autonomous vulnerability discovery and patching in real-world open-source software. The system integrates state-of-the-art program analysis (fuzzing, symbolic/concolic execution, static analysis) with deep LLM integration for both vulnerability discovery and automated program repair. ATLANTIS was architected for high concurrency, fail-safety, and maximal resource utilization under strict compute and LLM budget constraints, and achieved first place in the AIxCC final competition.
System Architecture and Resource Management
ATLANTIS is deployed as a Kubernetes-based distributed system on Azure, orchestrated via Terraform. The architecture is two-tiered: CRS-level nodes manage global services (web server, logging, LLM proxy), while CP-level nodes are dynamically spawned per challenge project (CP). Each CP-level node runs a CP-MANAGER, which allocates compute and LLM budgets, builds the target, and launches the appropriate bug-finding and patching modules.
Resource allocation is dynamic and proportional to the number of concurrent CPs and harnesses, with careful rate-limiting and budget enforcement for LLM usage. The system supports both commercial LLM APIs (OpenAI, Anthropic, Gemini) and custom fine-tuned LLMs for patching, with internal rate-limiting to prevent resource starvation.
Vulnerability Discovery: Multi-Engine, Multi-Language, LLM-Augmented Fuzzing
ATLANTIS-C: C/C++ Vulnerability Discovery
ATLANTIS-C is a Kafka-microservice-based, multi-fuzzer ensemble system integrating LibAFL, AFL++, and libFuzzer, with containerized services for harness building, task scheduling, corpus management, and crash triage. Seven instrumentation modes are supported, with parallelized builds to minimize startup latency. Time-based task scheduling (epochs) enables dynamic harness prioritization and fuzzer fallback, with real-time harness deprioritization based on static and dynamic reachability analysis.
Corpus management is distributed, with LLM-assisted initial corpus selection from a 400+ project, 90-category seed dataset. LLM components (DEEPGENERATOR, LLM-Augmented Mutator) generate high-quality seeds and perform semantic mutations when fuzzers stagnate. Directed fuzzing is implemented via BULLSEYE, which combines static closeness centrality and runtime discovery metrics for power scheduling and input prioritization, outperforming AFL++ in TTE and unique crash discovery on most targets.
ATLANTIS-Java: Sink-Centered Java Vulnerability Discovery
ATLANTIS-Java is designed around the observation that Java vulnerabilities are predominantly sink-centered (e.g., unsafe API usage). The system statically identifies sinkpoints (via CodeQL and custom YAML-configured lists), then orchestrates ensemble fuzzing, sinkpoint-aware exploration, and exploitation. Beep seeds (inputs reaching sinks) are tracked and exploited via LLM-based agents (ExpKit), which achieved an 81.3% success rate in converting reached-but-unexploited sinks into PoVs.
A custom concolic executor built on GraalVM Espresso provides symbolic execution for deep path exploration and exploit synthesis, with JVM-level value wrapping and Z3-based constraint solving. Directed fuzzing is implemented via a modified Jazzer, with function-level and basic-block-level distance computation, and dynamic scheduling of up to 15 concurrent sink targets.
ATLANTIS-Multilang: Multi-Language, Multi-Strategy Fuzzing
ATLANTIS-Multilang (UNIAFL) is a microservice-based, language-agnostic fuzzer supporting C, C++, and Java, with modular input generators at varying LLM-dependence levels. These include:
- Hybrid Fuzzer: Concolic execution with LLM-modeled external functions, decoupled executor/solver architecture, and a novel fusing mutator for cross-seed solution reuse.
- Function-level Dictionary-based Fuzzing: LLM-generated, context-aware dictionaries for targeted mutation.
- Testlang-based Generation: LLM-reversed harness analysis to produce JSON-schema-based input grammars, with Python generators for complex formats.
- MLLA (Multilang-LLM-Agent): Multi-agent LLM pipeline for call graph parsing, bug candidate detection, and script-based payload generation, with domain knowledge integration and sanitizer-aware exploit guides.
Directed fuzzing is implemented via line-coverage-based seed scoring, with a mixed random/score-based seed selection strategy to avoid overfitting and maintain exploration diversity.
Automated Program Repair: Ensemble LLM-Driven Patching
ATLANTIS-Patching is a web-service-based, ensemble agent system for automated patch generation. The CRETE framework provides a unified environment for agent development, with reusable components for build management, fault localization, and patch validation. Agents include:
- MARTIAN: ReAct-style, function-level patching with code search and editing tools.
- MULTIRETRIEVAL: Iterative, multi-turn code retrieval and patching with AST/text/file-based context acquisition.
- PRISM: Hierarchical, team-based multi-agent system with specialized analysis, patch, and evaluation teams.
- VINCENT: Property-guided patching with LLM-inferred program properties and code-embedding-based retrieval.
- CLAUDELIKE: Claude Code-inspired, file editor tool-based patching with sub-agent delegation.
- Open-source agents: AIDER and SWE-AGENT for coverage of trivial and multi-step repair cases.
A two-level policy enforcement mechanism (agent-level prompt engineering and system-level rule-based checks) ensures patches are plausible, compilable, and do not modify harnesses.
Custom LLMs are fine-tuned for code context retrieval, using multi-turn GRPO reinforcement learning to optimize retrieval policies for patching success. Empirical results show that multi-turn retrieval of missing type/function definitions is critical for patch correctness, and RL-fine-tuned retrievers outperform base models in context selection.
Static Analysis and SARIF Assessment
ATLANTIS-SARIF integrates static and dynamic call graph construction (CodeQL, SVF, SootUp, DynamoRIO, Jazzer) for reachability analysis, with three-level confidence annotation. SARIF report validation is conservative, requiring concrete evidence (PoV or patch) for correctness, and employs an LLM-based matcher for semantic correlation of SARIF reports with internal artifacts. This approach maximizes scoring precision and avoids penalties from false positives.
ATLANTIS achieved the highest score in the AIxCC final, with 61% vulnerability coverage and 72.1% patch success rate, outperforming all other teams. Module-level analysis shows that ATLANTIS-Multilang contributed 71.2% of verified PoVs, with LLM-powered modules providing significant incremental coverage, especially for complex, structured-input targets. Ensemble patching was essential: no single agent solved all cases, and diversity in agent design (context retrieval, reasoning, tool integration) was critical for robustness.
Custom benchmarks (56 C/C++ and 40 Java projects, 282 vulnerabilities) were used for systematic evaluation, with detailed breakdowns of vulnerability types, patch sizes, and harness coverage. ATLANTIS modules were stress-tested on edge cases (deep call chains, complex input formats, misleading documentation), and LLM-powered components were evaluated for data leakage, context window handling, and robustness to incomplete information.
Implementation Considerations and Trade-offs
- Scalability: The system is engineered for high concurrency, with dynamic resource allocation and fail-safe design. Parallelized builds, microservice isolation, and shared-memory protocols minimize overhead.
- LLM Integration: LLM usage is rate-limited and budgeted per module, with fallback strategies and model selection based on cost-performance trade-offs. Custom LLMs are used where commercial models are insufficient (e.g., code context retrieval).
- Coverage vs. Precision: Directed fuzzing and sink-centered analysis improve precision for security-relevant bugs, but require accurate static/dynamic analysis and domain knowledge integration. Ensemble approaches mitigate the risk of overfitting or missing edge cases.
- Patch Validation: Two-level policy enforcement and ensemble agent design maximize patch correctness and robustness, but increase system complexity and resource requirements.
- Context Engineering: Empirical results highlight the importance of precise code context retrieval for both vulnerability discovery and patching. Multi-turn, RL-optimized retrievers are essential for scaling to large codebases under context window constraints.
Implications and Future Directions
ATLANTIS demonstrates that deep integration of LLMs with traditional program analysis can achieve high-precision, high-coverage autonomous vulnerability discovery and patching in real-world software. The modular, ensemble-based architecture is robust to individual component failures and adaptable to diverse codebases and vulnerability types.
Future work should address:
- Generalization to more languages and frameworks: Extending concolic execution, static analysis, and LLM-based input generation to additional languages (e.g., Rust, Go, Python).
- Improved RL-based context retrieval: Scaling multi-turn retrieval to deeper call graphs, integrating semantic diffing, and reducing catastrophic forgetting.
- End-to-end learning: Joint optimization of vulnerability discovery and patching pipelines, possibly with self-play or co-training.
- Human-in-the-loop integration: Leveraging human feedback for ambiguous cases, especially in patch correctness and SARIF assessment.
- Deployment and commercialization: Transitioning from competition settings to production environments, with attention to scalability, security, and maintainability.
Conclusion
ATLANTIS sets a new standard for autonomous, AI-driven vulnerability discovery and patching, combining advanced program analysis, LLM-powered reasoning, and robust system engineering. Its success in the AIxCC competition validates the feasibility of deploying such systems at scale, and its open-source release provides a foundation for further research and practical adoption in automated cybersecurity. The system's modularity, ensemble strategies, and deep LLM integration offer a blueprint for future autonomous security tools capable of operating across heterogeneous, real-world software ecosystems.