CodeX-Verify Multi-Agent Verification

Updated 25 November 2025

CodeX-Verify is a multi-agent system that integrates four specialized static analyzers running asynchronously to detect flaws in LLM-generated code.
It combines an information-theoretic approach with empirical validation to achieve high detection recall and near-real-time performance in CI/CD pipelines.
The system introduces a novel compound vulnerability risk model that amplifies security risks, offering significant improvements over traditional additive scoring.

CodeX-Verify is a multi-agent code verification system designed to detect logic, security, performance, and maintainability flaws in code—particularly for evaluating LLM-generated patches. CodeX-Verify integrates four specialized static analyzers running asynchronously, combining their outputs through a weighted aggregation and structured decision logic. The system is notable for its information-theoretic proof of detection improvement via specialization, empirical validation on verified datasets, and novel compound-risk modeling for multiple vulnerabilities. Its architecture enables near–real-time verification suitable for high-throughput continuous integration (CI), while maintaining high detection recall, including compound vulnerabilities that amplify security risk multiplicatively rather than additively (Rajan, 20 Nov 2025).

1. Multi-Agent Architecture and Workflow

CodeX-Verify consists of four parallel, specialized agents controlled by an AsyncOrchestrator. Each agent targets a non-redundant dimension of code analysis:

Correctness Critic: Detects logic errors, missed edge cases, and exception-handling gaps.
Security Auditor: Recognizes patterns from OWASP Top-10 and CWE—including SQL injection, command injection, unsafe deserialization, hardcoded secrets, and applies entropy-based secret detection.
Performance Profiler: Assesses algorithmic complexity (O(1)–O(2ⁿ)), resource leaks, and performance anti-patterns.
Style Linter: Evaluates maintainability based on cyclomatic and Halstead complexity, naming conventions, and docstring density.

Workflow:

AsyncOrchestrator simultaneously runs all four agent evaluations on input code.
Agents output issue lists and normalized scores ( $S_1$ – $S_4$ ).
Aggregator merges issues, identifies compound vulnerability pairs, and computes a weighted system score.
Decision Logic issues FAIL, WARNING, or PASS verdicts based on a three-level rule set:
- Any CRITICAL or compound vulnerability: FAIL.
- One SECURITY HIGH or two CORRECTNESS HIGH: FAIL.
- Borderline scores: WARNING.
- Otherwise: PASS.

This parallel agent design supports high throughput and enables post-hoc configuration of agent involvement and aggregation logic.

2. Theoretical Basis for Multi-Agent Verification

The system’s foundation is an information-theoretic proof (Theorem 1) establishing that combining specialized agents with non-redundant detection patterns increases bug-identification capacity beyond any single analyzer. Formally, letting $A_i$ denote the observation from agent $i$ and $B$ the ground-truth bug labeling:

$I(A_1,\ldots,A_4; B) = \sum_{i=1}^{4} I(A_i; B \mid A_1,\ldots,A_{i-1}),$

where $I(\cdot\,;\,\cdot)$ denotes mutual information. If each conditional term is strictly positive due to non-overlapping detection competencies, the system’s total information on $B$ strictly exceeds that of the best individual agent.

Under low agent decision correlation ( $\rho_{ij}$ small), the probability of combined detection closely approximates independent detection:

$P_\text{combined} \approx 1 - \prod_{i=1}^4 (1 - p_i),$

where $p_i$ is the solo accuracy of agent $i$ . For $p_1=0.759$ , $p_2=0.207$ , $p_3=0.172$ , $p_4=0.172$ , the upper-bound $P_\text{combined} \approx 0.96$ .

Theoretical analysis (Theorem 2) confirms diminishing marginal returns: Adding agents in order of their individual performance produces successively smaller improvements, due to conditioning on already-detected information.

3. Agent Correlation and Detection Dynamics

Pairwise Pearson correlation analysis over 99 labeled samples revealed low inter-agent correlation coefficients ( $\rho_{ij}$ in $[0.05,\,0.25]$ ):

Agent Pair	$\rho_{ij}$
Correctness, Security	0.15
Correctness, Perf.	0.25
Correctness, Style	0.20
Security, Perf.	0.10
Security, Style	0.05
Perf., Style	0.15

Low $\rho$ implies error independence across agents, supporting substantial recall improvement with modest impacts on precision when aggregating detections. This underpins empirically observed system gains in multi-agent configurations.

4. Compound Vulnerability Risk Modeling

Traditional risk assessment models sum individual vulnerability risks (e.g., $Risk_\text{total} = Risk(v_1) + Risk(v_2)$ ), greatly underestimating the danger of synergistic (compound) attack vectors. The CodeX-Verify model (Theorem 3) multiplies risks and introduces a synergy factor $\alpha(v_1, v_2) > 1$ :

$Risk(v_1 \cup v_2) = Risk(v_1) \times Risk(v_2) \times \alpha(v_1, v_2).$

For instance, an SQL injection ($10$) and credential leak ($10$) combine to yield $10 \times 10 \times 3.0 = 300$ risk—fifteen times the traditional sum. The Security Auditor programmatically enumerates all exploitable vulnerability pairs, computes compound risk, and classifies any pair above a CRITICAL threshold as a system-level blocking issue.

5. Empirical Results and Ablation Studies

Datasets and Baselines

Verified corpus: 99 Python samples (71 buggy, 28 correct), spanning 16 bug categories.
Throughput testing: 300 Claude Sonnet 4.5-generated patches, no labels.
Baselines: Codex (no verify, 40% accuracy), static analyzers (65% acc., 35% FPR), Meta Prompt Testing (75% TPR, 8.6% FPR).

System-Level Performance

System	Accuracy	TPR	FPR	Precision	F1
CodeX-Verify	68.7%	76.1%	50.0%	79.4%	0.777

Statistical analysis demonstrated a +28.7 pp accuracy gain over Codex ( $p<0.001$ ) and +3.7 pp over static analyzers ( $p<0.05$ ).

Multi-Agent Ablation

# Agents	Mean Accuracy	Marginal Gain (pp)
1	32.8%	—
2	47.7%	+14.9
3	61.2%	+13.5
4	72.4%	+11.2

Best 2-agent pair (Correctness+Performance): 79.3% accuracy, 83.3% TPR, 40.0% FPR.
Marginal gains shrink for additional agents, supporting the diminishing returns model.

Throughput

Latency: Mean 148 ms, all samples $<$ 200 ms.
Verdicts (out of 300): FAIL 72%, WARNING 23%, PASS 2%, ERROR 3%.
Compound vulnerabilities: All detected (4 pairs, 100% hit rate).

6. Application Scenarios and Deployment

CodeX-Verify is designed for drop-in integration in CI/CD pipelines, enabling sub-200 ms gating of code submissions and merge requests. Real-time feedback can be provided to developers via IDE plugins, with multi-agent–sourced diagnostics enabling prompt issue localization. System trade-offs can be tailored by agent selection and weighting:

Security-critical contexts (finance, healthcare): Prioritize recall with all agents (76% TPR, 50% FPR).
Developer-friendly workflows: Use Correctness + Performance (79.3% acc, 40% FPR).

Weighted aggregation can be tuned by $w_i \propto p_i \cdot (1-\bar{\rho}_i) \cdot \gamma_i$ ; reference values are: Security 0.45, Correctness 0.35, Performance 0.15, Style 0.05.

7. Technical and Scientific Contributions

First formal mutual information proof that agent specialization and low-correlation aggregation strictly improve bug detection, validated by ablation across all agent combinations.
Compound vulnerability risk model establishes 15× risk amplification for certain attack chains, a substantial revision over additive scoring.
Static analysis matching execution-based TPR (76% vs. 75%) at sub-200 ms latency—unprecedented among static analyzers.
Empirical +39.7 percentage point accuracy gain over the best single agent, exceeding previously reported multi-agent F1 improvements in the literature (e.g., +18.7 pp for AutoReview).
Systematically advances both the theoretical and practical frontiers of multi-agent automated code verification for LLM-generated code (Rajan, 20 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Multi-Agent Code Verification with Compound Vulnerability Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CodeX-Verify.