Papers
Topics
Authors
Recent
2000 character limit reached

CodeX-Verify Multi-Agent Verification

Updated 25 November 2025
  • CodeX-Verify is a multi-agent system that integrates four specialized static analyzers running asynchronously to detect flaws in LLM-generated code.
  • It combines an information-theoretic approach with empirical validation to achieve high detection recall and near-real-time performance in CI/CD pipelines.
  • The system introduces a novel compound vulnerability risk model that amplifies security risks, offering significant improvements over traditional additive scoring.

CodeX-Verify is a multi-agent code verification system designed to detect logic, security, performance, and maintainability flaws in code—particularly for evaluating LLM-generated patches. CodeX-Verify integrates four specialized static analyzers running asynchronously, combining their outputs through a weighted aggregation and structured decision logic. The system is notable for its information-theoretic proof of detection improvement via specialization, empirical validation on verified datasets, and novel compound-risk modeling for multiple vulnerabilities. Its architecture enables near–real-time verification suitable for high-throughput continuous integration (CI), while maintaining high detection recall, including compound vulnerabilities that amplify security risk multiplicatively rather than additively (Rajan, 20 Nov 2025).

1. Multi-Agent Architecture and Workflow

CodeX-Verify consists of four parallel, specialized agents controlled by an AsyncOrchestrator. Each agent targets a non-redundant dimension of code analysis:

  • Correctness Critic: Detects logic errors, missed edge cases, and exception-handling gaps.
  • Security Auditor: Recognizes patterns from OWASP Top-10 and CWE—including SQL injection, command injection, unsafe deserialization, hardcoded secrets, and applies entropy-based secret detection.
  • Performance Profiler: Assesses algorithmic complexity (O(1)–O(2ⁿ)), resource leaks, and performance anti-patterns.
  • Style Linter: Evaluates maintainability based on cyclomatic and Halstead complexity, naming conventions, and docstring density.

Workflow:

  1. AsyncOrchestrator simultaneously runs all four agent evaluations on input code.
  2. Agents output issue lists and normalized scores (S1S_1S4S_4).
  3. Aggregator merges issues, identifies compound vulnerability pairs, and computes a weighted system score.
  4. Decision Logic issues FAIL, WARNING, or PASS verdicts based on a three-level rule set:
    • Any CRITICAL or compound vulnerability: FAIL.
    • One SECURITY HIGH or two CORRECTNESS HIGH: FAIL.
    • Borderline scores: WARNING.
    • Otherwise: PASS.

This parallel agent design supports high throughput and enables post-hoc configuration of agent involvement and aggregation logic.

2. Theoretical Basis for Multi-Agent Verification

The system’s foundation is an information-theoretic proof (Theorem 1) establishing that combining specialized agents with non-redundant detection patterns increases bug-identification capacity beyond any single analyzer. Formally, letting AiA_i denote the observation from agent ii and BB the ground-truth bug labeling:

I(A1,,A4;B)=i=14I(Ai;BA1,,Ai1),I(A_1,\ldots,A_4; B) = \sum_{i=1}^{4} I(A_i; B \mid A_1,\ldots,A_{i-1}),

where I(;)I(\cdot\,;\,\cdot) denotes mutual information. If each conditional term is strictly positive due to non-overlapping detection competencies, the system’s total information on BB strictly exceeds that of the best individual agent.

Under low agent decision correlation (ρij\rho_{ij} small), the probability of combined detection closely approximates independent detection:

Pcombined1i=14(1pi),P_\text{combined} \approx 1 - \prod_{i=1}^4 (1 - p_i),

where pip_i is the solo accuracy of agent ii. For p1=0.759p_1=0.759, p2=0.207p_2=0.207, p3=0.172p_3=0.172, p4=0.172p_4=0.172, the upper-bound Pcombined0.96P_\text{combined} \approx 0.96.

Theoretical analysis (Theorem 2) confirms diminishing marginal returns: Adding agents in order of their individual performance produces successively smaller improvements, due to conditioning on already-detected information.

3. Agent Correlation and Detection Dynamics

Pairwise Pearson correlation analysis over 99 labeled samples revealed low inter-agent correlation coefficients (ρij\rho_{ij} in [0.05,0.25][0.05,\,0.25]):

Agent Pair ρij\rho_{ij}
Correctness, Security 0.15
Correctness, Perf. 0.25
Correctness, Style 0.20
Security, Perf. 0.10
Security, Style 0.05
Perf., Style 0.15

Low ρ\rho implies error independence across agents, supporting substantial recall improvement with modest impacts on precision when aggregating detections. This underpins empirically observed system gains in multi-agent configurations.

4. Compound Vulnerability Risk Modeling

Traditional risk assessment models sum individual vulnerability risks (e.g., Risktotal=Risk(v1)+Risk(v2)Risk_\text{total} = Risk(v_1) + Risk(v_2)), greatly underestimating the danger of synergistic (compound) attack vectors. The CodeX-Verify model (Theorem 3) multiplies risks and introduces a synergy factor α(v1,v2)>1\alpha(v_1, v_2) > 1:

Risk(v1v2)=Risk(v1)×Risk(v2)×α(v1,v2).Risk(v_1 \cup v_2) = Risk(v_1) \times Risk(v_2) \times \alpha(v_1, v_2).

For instance, an SQL injection ($10$) and credential leak ($10$) combine to yield 10×10×3.0=30010 \times 10 \times 3.0 = 300 risk—fifteen times the traditional sum. The Security Auditor programmatically enumerates all exploitable vulnerability pairs, computes compound risk, and classifies any pair above a CRITICAL threshold as a system-level blocking issue.

5. Empirical Results and Ablation Studies

Datasets and Baselines

  • Verified corpus: 99 Python samples (71 buggy, 28 correct), spanning 16 bug categories.
  • Throughput testing: 300 Claude Sonnet 4.5-generated patches, no labels.
  • Baselines: Codex (no verify, 40% accuracy), static analyzers (65% acc., 35% FPR), Meta Prompt Testing (75% TPR, 8.6% FPR).

System-Level Performance

System Accuracy TPR FPR Precision F1
CodeX-Verify 68.7% 76.1% 50.0% 79.4% 0.777

Statistical analysis demonstrated a +28.7 pp accuracy gain over Codex (p<0.001p<0.001) and +3.7 pp over static analyzers (p<0.05p<0.05).

Multi-Agent Ablation

# Agents Mean Accuracy Marginal Gain (pp)
1 32.8%
2 47.7% +14.9
3 61.2% +13.5
4 72.4% +11.2
  • Best 2-agent pair (Correctness+Performance): 79.3% accuracy, 83.3% TPR, 40.0% FPR.
  • Marginal gains shrink for additional agents, supporting the diminishing returns model.

Throughput

  • Latency: Mean 148 ms, all samples <<200 ms.
  • Verdicts (out of 300): FAIL 72%, WARNING 23%, PASS 2%, ERROR 3%.
  • Compound vulnerabilities: All detected (4 pairs, 100% hit rate).

6. Application Scenarios and Deployment

CodeX-Verify is designed for drop-in integration in CI/CD pipelines, enabling sub-200 ms gating of code submissions and merge requests. Real-time feedback can be provided to developers via IDE plugins, with multi-agent–sourced diagnostics enabling prompt issue localization. System trade-offs can be tailored by agent selection and weighting:

  • Security-critical contexts (finance, healthcare): Prioritize recall with all agents (76% TPR, 50% FPR).
  • Developer-friendly workflows: Use Correctness + Performance (79.3% acc, 40% FPR).

Weighted aggregation can be tuned by wipi(1ρˉi)γiw_i \propto p_i \cdot (1-\bar{\rho}_i) \cdot \gamma_i; reference values are: Security 0.45, Correctness 0.35, Performance 0.15, Style 0.05.

7. Technical and Scientific Contributions

  • First formal mutual information proof that agent specialization and low-correlation aggregation strictly improve bug detection, validated by ablation across all agent combinations.
  • Compound vulnerability risk model establishes 15× risk amplification for certain attack chains, a substantial revision over additive scoring.
  • Static analysis matching execution-based TPR (76% vs. 75%) at sub-200 ms latency—unprecedented among static analyzers.
  • Empirical +39.7 percentage point accuracy gain over the best single agent, exceeding previously reported multi-agent F1 improvements in the literature (e.g., +18.7 pp for AutoReview).
  • Systematically advances both the theoretical and practical frontiers of multi-agent automated code verification for LLM-generated code (Rajan, 20 Nov 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CodeX-Verify.