Multi-Agent Code Verification with Compound Vulnerability Detection (2511.16708v1)

Published 20 Nov 2025 in cs.SE, cs.AI, and cs.MA

Abstract: LLMs generate buggy code: 29.6% of SWE-bench "solved" patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, confirmed by measuring agent correlation of p = 0.05--0.25. We also show that multiple vulnerabilities in the same code create exponentially more risk than previously thought--SQL injection plus exposed credentials creates 15x more danger (risk 300 vs. 20) than traditional models predict. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method while running faster and without test execution. We tested 15 different agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with gains of +14.9pp, +13.5pp, and +11.2pp for agents 2, 3, and 4. The best two-agent combination reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in under 200ms per sample, making this practical for production use.

Summary

The paper demonstrates that a multi-agent framework improves bug detection by 39.7 percentage points compared to single-agent systems.
The methodology leverages parallel execution and mutual information theory to combine low-correlation outputs from agents focused on correctness, security, performance, and style.
Experimental validation on 99 code samples achieved a 76.1% true positive rate without code execution, underlining its practical efficiency.

Multi-Agent Code Verification with Compound Vulnerability Detection

Introduction

The paper introduces CodeX-Verify, a multi-agent system designed to address the common issue of LLMs generating buggy code. Previous benchmarks, such as SWE-bench and BaxBench, reveal high rates of failure and vulnerabilities in LLM-generated code, with 29.6% of patches failing production suitability and 62% containing significant vulnerabilities. Existing verification tools, while somewhat effective, often operate on singular focus, thereby missing compounded bug patterns.

CodeX-Verify employs four specialized agents—Correctness, Security, Performance, and Style—to concurrently analyze code from distinct problem-centric perspectives, with results demonstrating substantial improvements in bug detection rates.

Methodology and Theoretical Framework

CodeX-Verify adopts a multi-agent strategy, leveraging parallel execution to cover varied bug types. The paper mathematically proves the efficiency of this approach by asserting that combining agents observing different bug patterns increases overall detection compared to single-agent systems. The mutual information theorem is used to substantiate this claim, showing that combined agent outputs provide more informational gain on bug presence.

Agents operate with low correlation (ρ = 0.05–0.25) across different bug categories, confirming that their outputs are complementary rather than redundant. The diminishing returns theorem predicts the pattern of improvement, substantiated by empirical testing that shows progressive accuracy gains as more agents are added.

Compound Vulnerabilities

An essential aspect of CodeX-Verify is the formalization of compound vulnerability risk assessment. The paper demonstrates that multiple bugs in a code combine multiplicatively in terms of risk, unlike traditional models which assume additive risk. This insight stems from attack graph theory and is exemplified by the heightened risk when SQL injection is coupled with exposed credentials.

Mathematically, the compound risk is modeled as $\text{Risk}(v_1 \cup v_2) = \text{Risk}(v_1) \times \text{Risk}(v_2) \times \alpha(v_1, v_2)$ , with $\alpha$ representing the amplification factor. This model shows that combined vulnerabilities can amplify risk up to 15× depending on context.

Experimental Validation

Testing on a dataset of 99 code samples with verified labels displays the effectiveness of CodeX-Verify. It achieves a true positive rate of 76.1%, matching the best-known test-based method while running faster and without code execution. The system outperforms existing static analyzers and single-agent systems significantly, validating the multi-agent theory through real-world results.

The tests with 15 agent combinations further underscore the superiority of multi-agent approaches, yielding a 39.7 percentage point improvement over single-agent configurations. Optimally, a two-agent configuration (Correctness and Performance) achieves the highest accuracy, suggesting simplified deployments can also be effective when security isn't the primary concern.

Implications and Future Directions

CodeX-Verify’s applicative value spans across various AI domains requiring reliable deployment of LLM-generated code, presenting implications for enterprise-scale system deployments where security and correctness are paramount. The model pushes the boundaries of current static analysis methodologies, indicating a strong potential for integration into CI/CD pipelines, real-time code review systems, and automated bug fixing tools.

Future developments can explore the hybridization of static and dynamic testing methodologies to further reduce false positives and enhance bug detection coverage. Additionally, expanding the model to accommodate more programming languages and attack patterns broadens the utility across diverse software ecosystems.

Conclusion

The paper describes CodeX-Verify as a practical and theoretical advancement in the domain of code verification systems. By leveraging a multi-agent framework, CodeX-Verify addresses the inadequacies of traditional verification tools and sets a precedence for holistic vulnerability detection. The results indicate that such systems can more effectively safeguard against LLM-generated code errors, promoting enhanced security and reliability in AI-assisted software engineering. The integration of amplification models for compound vulnerabilities marks a significant shift in risk assessment methodologies, framing new standards for multi-agent systems in real-world applications.