Multi-Agent Judging Framework

Updated 16 November 2025

Multi-Agent Judging Framework is a system where specialized SLM/LLM agents (Critic, Defender, Judge) collaborate through structured rounds to evaluate content safety.
The framework uses iterative debate with fixed safety aspects and formal aggregation rules to refine evaluations and improve semantic alignment.
Empirical results show that this approach enhances reliability and reduces inference costs compared to traditional large-model judges.

A Multi-Agent Judging Framework comprises a set of autonomous agents—each typically instantiated as a Small LLM (SLM) or LLM and assigned distinct roles—that collectively evaluate responses to adversarial prompts or other tasks through structured debate and consensus-building. This paradigm is particularly prominent in scalable safety assessment of LLMs, where cost reduction and semantic fidelity are crucial. The approach leverages adversarial role conditioning, inter-agent debate, and formal aggregation rules to capture nuanced violations and achieve reliability comparable to high-cost frontier models, but at an order-of-magnitude lower inference cost.

1. Architecture: Critic, Defender, and Judge Agents

The framework (Lin et al., 9 Nov 2025) instantiates three core agents:

Critic Agent (C): Given a prompt–response pair $(x, y)$ , the Critic scores the response across $K = 5$ fixed safety aspects (e.g., toxicity, privacy, illegal advice), producing for each aspect $a_k$ a sub-score $u_k \in [1,10]$ , a natural-language critique, an aggregated risk level (1–5), and an overall ten-point score.
Defender Agent (D): Receives Critic outputs $(u_k, m_k)$ and for each aspect rebutts with counter-score $v_k \in [1,10]$ and a defense explanation, aiming to lower perceived risk and simulate adversarial robustness.
Judge Agent (J): After $R$ rounds of Critic–Defender interaction, the Judge integrates all arguments and outputs final attack success (binary), five-level risk, and ten-point continuous risk, with a rationale.

Agent specialization is achieved by prompt engineering and context conditioning; agents share the same base model, but system-level prompts encode distinct roles and evaluation rubrics.

2. Debate Protocol and Value Alignment

Each evaluation proceeds through structured rounds:

Pre-Debate Alignment: All agents are initialized with a shared, explicit taxonomy of $K$ safety aspects to constrain scope and prevent topic drift.
Round Structure: In each round $r$ , Critic first produces per-aspect scores and messages; Defender rebuts with counter-scores and responses. This is strictly turn-taking; agents do not directly alter one another's scores from previous rounds.
Stopping Criterion: $R = 3$ rounds empirically yields optimal accuracy–cost tradeoff; further rounds introduce error accumulation and diminish returns. Optionally, early exit occurs if scores stabilize without substantive changes.

Three rounds are found to be the “sweet spot”: maximal improvement in semantic reconciliation with token cost growing linearly in $R$ .

3. Formal Scoring and Aggregation Mechanics

Judging is formalized by aggregation equations:

Critic per-aspect utility:

$u_{r, k} = U_c(x, y; a_k), \quad U_c: (x, y, a) \mapsto [1,10]$

Aggregate per round:

$U_c^r(x, y) = \frac{1}{K}\sum_{k=1}^K u_{r, k}$

Defender counter-argument:

$v_{r, k} = S_d(x, y, u_{r, k}, a_k), \quad S_d: (x, y, u, a) \mapsto [1,10]$

Defender aggregate:

$U_d^r(x, y) = \frac{1}{K}\sum_{k=1}^K v_{r, k}$

Judge decision rule:

$D_j(c_1, d_1, \ldots, c_R, d_R) = \lambda \frac{1}{R} \sum_{r=1}^R U_c^r + (1-\lambda) \frac{1}{R} \sum_{r=1}^R U_d^r$

Continuous final risk score $D_j \in [1, 10]$ ; attack success is declared if $D_j > \tau$ (threshold $\tau$ selected by maximizing Cohen's $\kappa$ on validation). Mixing weight $\lambda$ and threshold $\tau$ are tuned for maximum ground-truth agreement.

4. Calibration, Prompt Engineering, and Training

No gradient-based fine-tuning is applied. Calibration relies on:

Prompt Engineering: All agents utilize rigorously engineered prompts encoding evaluation taxonomy and roles.
Pre-Debate Alignment: Ensures agents evaluate on the same $K$ aspects.
Noise Filter: Optional pre-processing to eliminate adversarial artifacts from inputs with additional LLM-based filtering.
Threshold/Weight Tuning: Only $\tau$ (success boundary) and $\lambda$ (critic–defender mixing) are learned, aligning agent consensus with HAJailBench human labels.

5. Inference Cost Reduction and Resource Efficiency

Each inference comprises $2R + 1$ SLM calls ( $R=3$ implies 7 calls); SLMs such as Qwen3-14B incur $\sim$ 5x less unit cost than GPT-4o judges. Overall, the multi-agent protocol requires only $0.46\times$ the cost of a single GPT-4o call, even when accounting for extra rounds. Lower per-call context size and parallelization further enhance token efficiency.

6. Empirical Evaluation: HAJailBench Benchmark

Dataset: HAJailBench collects 12,000 adversarial instances (100 harmful goals, 12 attack methods, multiple target LLMs) with expert-annotated ground truth (binary success/fail, 5-level risk, 10-point risk).
Metrics: Cohen’s $\kappa$ (agreement with ground truth), unit cost/query, cost ratio with GPT-4o baseline.
Results:
- GPT-4o judge: $\kappa=0.7627$ , cost = \$8.36×10⁻⁴/query
- Qwen3-14B multi-agent: $\kappa=0.7352$ , cost = \$3.85×10⁻⁴/query (54% lower)
- Multi-agent improves over single SLM JailJudge by 25–32% relative $\kappa$ gain, reduces cost by 54–82%.

Debate Round Ablation

Rounds	$\kappa$	Cost Ratio vs Optimal (R=3)
0	0.5709	0.30
1	0.6955	0.69
2	0.7143	0.87
3	0.7352	1.00 (optimal)
4	0.7260	1.05
5	0.7221	1.12

Accuracy sharply increases up to three rounds; further rounds introduce error drift, plateau, or degrade consensus.

7. Insights, Limitations, and Extensions

Key Insights: Structured debate with value-aligned roles enables cost-effective SLM ensembles to match semantic richness and reliability of large judge models for safety evaluation; fixed safety aspects anchor reasoning; maximal benefit realized after three rounds.
Limitations: HAJailBench covers 100 goals × 12 attacks; unseen or emergent attacks may challenge reliability. SLM-based agents are susceptible to multi-turn hallucinations and bias stacking. Human annotation in HAJailBench references a large-model judge in the second round, potentially introducing bias.
Potential Extensions: Dynamic topic reallocation, uncertainty-aware adaptive stopping, adversarial robustness mechanisms in debaters, human-in-the-loop review, continual learning, and cross-cultural calibration are identified as promising future directions.

A Multi-Agent Judging Framework, as exemplified by the Critic–Defender–Judge protocol (Lin et al., 9 Nov 2025), organizes refinement and consensus-building in safety evaluation via structured, prompt-conditioned SLM agents, with provable gains in semantic fidelity and cost efficiency on large-scale adversarial datasets. This modular debate architecture is effective in capturing and aligning nuanced jailbreak risks, setting a precedent for scalable, interpretable LLM safety assessment.

PDF Markdown Chat (Pro)

References (1)

Efficient LLM Safety Evaluation through Multi-Agent Debate (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Judging Framework.