CodeX-Verify Multi-Agent Verification
- CodeX-Verify is a multi-agent system that integrates four specialized static analyzers running asynchronously to detect flaws in LLM-generated code.
- It combines an information-theoretic approach with empirical validation to achieve high detection recall and near-real-time performance in CI/CD pipelines.
- The system introduces a novel compound vulnerability risk model that amplifies security risks, offering significant improvements over traditional additive scoring.
CodeX-Verify is a multi-agent code verification system designed to detect logic, security, performance, and maintainability flaws in code—particularly for evaluating LLM-generated patches. CodeX-Verify integrates four specialized static analyzers running asynchronously, combining their outputs through a weighted aggregation and structured decision logic. The system is notable for its information-theoretic proof of detection improvement via specialization, empirical validation on verified datasets, and novel compound-risk modeling for multiple vulnerabilities. Its architecture enables near–real-time verification suitable for high-throughput continuous integration (CI), while maintaining high detection recall, including compound vulnerabilities that amplify security risk multiplicatively rather than additively (Rajan, 20 Nov 2025).
1. Multi-Agent Architecture and Workflow
CodeX-Verify consists of four parallel, specialized agents controlled by an AsyncOrchestrator. Each agent targets a non-redundant dimension of code analysis:
- Correctness Critic: Detects logic errors, missed edge cases, and exception-handling gaps.
- Security Auditor: Recognizes patterns from OWASP Top-10 and CWE—including SQL injection, command injection, unsafe deserialization, hardcoded secrets, and applies entropy-based secret detection.
- Performance Profiler: Assesses algorithmic complexity (O(1)–O(2ⁿ)), resource leaks, and performance anti-patterns.
- Style Linter: Evaluates maintainability based on cyclomatic and Halstead complexity, naming conventions, and docstring density.
Workflow:
- AsyncOrchestrator simultaneously runs all four agent evaluations on input code.
- Agents output issue lists and normalized scores (–).
- Aggregator merges issues, identifies compound vulnerability pairs, and computes a weighted system score.
- Decision Logic issues FAIL, WARNING, or PASS verdicts based on a three-level rule set:
- Any CRITICAL or compound vulnerability: FAIL.
- One SECURITY HIGH or two CORRECTNESS HIGH: FAIL.
- Borderline scores: WARNING.
- Otherwise: PASS.
This parallel agent design supports high throughput and enables post-hoc configuration of agent involvement and aggregation logic.
2. Theoretical Basis for Multi-Agent Verification
The system’s foundation is an information-theoretic proof (Theorem 1) establishing that combining specialized agents with non-redundant detection patterns increases bug-identification capacity beyond any single analyzer. Formally, letting denote the observation from agent and the ground-truth bug labeling:
where denotes mutual information. If each conditional term is strictly positive due to non-overlapping detection competencies, the system’s total information on strictly exceeds that of the best individual agent.
Under low agent decision correlation ( small), the probability of combined detection closely approximates independent detection:
where is the solo accuracy of agent . For , , , , the upper-bound .
Theoretical analysis (Theorem 2) confirms diminishing marginal returns: Adding agents in order of their individual performance produces successively smaller improvements, due to conditioning on already-detected information.
3. Agent Correlation and Detection Dynamics
Pairwise Pearson correlation analysis over 99 labeled samples revealed low inter-agent correlation coefficients ( in ):
| Agent Pair | |
|---|---|
| Correctness, Security | 0.15 |
| Correctness, Perf. | 0.25 |
| Correctness, Style | 0.20 |
| Security, Perf. | 0.10 |
| Security, Style | 0.05 |
| Perf., Style | 0.15 |
Low implies error independence across agents, supporting substantial recall improvement with modest impacts on precision when aggregating detections. This underpins empirically observed system gains in multi-agent configurations.
4. Compound Vulnerability Risk Modeling
Traditional risk assessment models sum individual vulnerability risks (e.g., ), greatly underestimating the danger of synergistic (compound) attack vectors. The CodeX-Verify model (Theorem 3) multiplies risks and introduces a synergy factor :
For instance, an SQL injection ($10$) and credential leak ($10$) combine to yield risk—fifteen times the traditional sum. The Security Auditor programmatically enumerates all exploitable vulnerability pairs, computes compound risk, and classifies any pair above a CRITICAL threshold as a system-level blocking issue.
5. Empirical Results and Ablation Studies
Datasets and Baselines
- Verified corpus: 99 Python samples (71 buggy, 28 correct), spanning 16 bug categories.
- Throughput testing: 300 Claude Sonnet 4.5-generated patches, no labels.
- Baselines: Codex (no verify, 40% accuracy), static analyzers (65% acc., 35% FPR), Meta Prompt Testing (75% TPR, 8.6% FPR).
System-Level Performance
| System | Accuracy | TPR | FPR | Precision | F1 |
|---|---|---|---|---|---|
| CodeX-Verify | 68.7% | 76.1% | 50.0% | 79.4% | 0.777 |
Statistical analysis demonstrated a +28.7 pp accuracy gain over Codex () and +3.7 pp over static analyzers ().
Multi-Agent Ablation
| # Agents | Mean Accuracy | Marginal Gain (pp) |
|---|---|---|
| 1 | 32.8% | — |
| 2 | 47.7% | +14.9 |
| 3 | 61.2% | +13.5 |
| 4 | 72.4% | +11.2 |
- Best 2-agent pair (Correctness+Performance): 79.3% accuracy, 83.3% TPR, 40.0% FPR.
- Marginal gains shrink for additional agents, supporting the diminishing returns model.
Throughput
- Latency: Mean 148 ms, all samples 200 ms.
- Verdicts (out of 300): FAIL 72%, WARNING 23%, PASS 2%, ERROR 3%.
- Compound vulnerabilities: All detected (4 pairs, 100% hit rate).
6. Application Scenarios and Deployment
CodeX-Verify is designed for drop-in integration in CI/CD pipelines, enabling sub-200 ms gating of code submissions and merge requests. Real-time feedback can be provided to developers via IDE plugins, with multi-agent–sourced diagnostics enabling prompt issue localization. System trade-offs can be tailored by agent selection and weighting:
- Security-critical contexts (finance, healthcare): Prioritize recall with all agents (76% TPR, 50% FPR).
- Developer-friendly workflows: Use Correctness + Performance (79.3% acc, 40% FPR).
Weighted aggregation can be tuned by ; reference values are: Security 0.45, Correctness 0.35, Performance 0.15, Style 0.05.
7. Technical and Scientific Contributions
- First formal mutual information proof that agent specialization and low-correlation aggregation strictly improve bug detection, validated by ablation across all agent combinations.
- Compound vulnerability risk model establishes 15× risk amplification for certain attack chains, a substantial revision over additive scoring.
- Static analysis matching execution-based TPR (76% vs. 75%) at sub-200 ms latency—unprecedented among static analyzers.
- Empirical +39.7 percentage point accuracy gain over the best single agent, exceeding previously reported multi-agent F1 improvements in the literature (e.g., +18.7 pp for AutoReview).
- Systematically advances both the theoretical and practical frontiers of multi-agent automated code verification for LLM-generated code (Rajan, 20 Nov 2025).