Multi-Agent Judging Framework
- Multi-Agent Judging Framework is a system where specialized SLM/LLM agents (Critic, Defender, Judge) collaborate through structured rounds to evaluate content safety.
- The framework uses iterative debate with fixed safety aspects and formal aggregation rules to refine evaluations and improve semantic alignment.
- Empirical results show that this approach enhances reliability and reduces inference costs compared to traditional large-model judges.
A Multi-Agent Judging Framework comprises a set of autonomous agents—each typically instantiated as a Small LLM (SLM) or LLM and assigned distinct roles—that collectively evaluate responses to adversarial prompts or other tasks through structured debate and consensus-building. This paradigm is particularly prominent in scalable safety assessment of LLMs, where cost reduction and semantic fidelity are crucial. The approach leverages adversarial role conditioning, inter-agent debate, and formal aggregation rules to capture nuanced violations and achieve reliability comparable to high-cost frontier models, but at an order-of-magnitude lower inference cost.
1. Architecture: Critic, Defender, and Judge Agents
The framework (Lin et al., 9 Nov 2025) instantiates three core agents:
- Critic Agent (C): Given a prompt–response pair , the Critic scores the response across fixed safety aspects (e.g., toxicity, privacy, illegal advice), producing for each aspect a sub-score , a natural-language critique, an aggregated risk level (1–5), and an overall ten-point score.
- Defender Agent (D): Receives Critic outputs and for each aspect rebutts with counter-score and a defense explanation, aiming to lower perceived risk and simulate adversarial robustness.
- Judge Agent (J): After rounds of Critic–Defender interaction, the Judge integrates all arguments and outputs final attack success (binary), five-level risk, and ten-point continuous risk, with a rationale.
Agent specialization is achieved by prompt engineering and context conditioning; agents share the same base model, but system-level prompts encode distinct roles and evaluation rubrics.
2. Debate Protocol and Value Alignment
Each evaluation proceeds through structured rounds:
- Pre-Debate Alignment: All agents are initialized with a shared, explicit taxonomy of safety aspects to constrain scope and prevent topic drift.
- Round Structure: In each round , Critic first produces per-aspect scores and messages; Defender rebuts with counter-scores and responses. This is strictly turn-taking; agents do not directly alter one another's scores from previous rounds.
- Stopping Criterion: rounds empirically yields optimal accuracy–cost tradeoff; further rounds introduce error accumulation and diminish returns. Optionally, early exit occurs if scores stabilize without substantive changes.
Three rounds are found to be the “sweet spot”: maximal improvement in semantic reconciliation with token cost growing linearly in .
3. Formal Scoring and Aggregation Mechanics
Judging is formalized by aggregation equations:
- Critic per-aspect utility:
- Aggregate per round:
- Defender counter-argument:
- Defender aggregate:
- Judge decision rule:
Continuous final risk score ; attack success is declared if (threshold selected by maximizing Cohen's on validation). Mixing weight and threshold are tuned for maximum ground-truth agreement.
4. Calibration, Prompt Engineering, and Training
No gradient-based fine-tuning is applied. Calibration relies on:
- Prompt Engineering: All agents utilize rigorously engineered prompts encoding evaluation taxonomy and roles.
- Pre-Debate Alignment: Ensures agents evaluate on the same aspects.
- Noise Filter: Optional pre-processing to eliminate adversarial artifacts from inputs with additional LLM-based filtering.
- Threshold/Weight Tuning: Only (success boundary) and (critic–defender mixing) are learned, aligning agent consensus with HAJailBench human labels.
5. Inference Cost Reduction and Resource Efficiency
Each inference comprises $2R + 1$ SLM calls ( implies 7 calls); SLMs such as Qwen3-14B incur 5x less unit cost than GPT-4o judges. Overall, the multi-agent protocol requires only the cost of a single GPT-4o call, even when accounting for extra rounds. Lower per-call context size and parallelization further enhance token efficiency.
6. Empirical Evaluation: HAJailBench Benchmark
- Dataset: HAJailBench collects 12,000 adversarial instances (100 harmful goals, 12 attack methods, multiple target LLMs) with expert-annotated ground truth (binary success/fail, 5-level risk, 10-point risk).
- Metrics: Cohen’s (agreement with ground truth), unit cost/query, cost ratio with GPT-4o baseline.
- Results:
- GPT-4o judge: , cost = \$8.36×10⁻⁴/query
- Qwen3-14B multi-agent: , cost = \$3.85×10⁻⁴/query (54% lower)
- Multi-agent improves over single SLM JailJudge by 25–32% relative gain, reduces cost by 54–82%.
Debate Round Ablation
| Rounds | Cost Ratio vs Optimal (R=3) | |
|---|---|---|
| 0 | 0.5709 | 0.30 |
| 1 | 0.6955 | 0.69 |
| 2 | 0.7143 | 0.87 |
| 3 | 0.7352 | 1.00 (optimal) |
| 4 | 0.7260 | 1.05 |
| 5 | 0.7221 | 1.12 |
Accuracy sharply increases up to three rounds; further rounds introduce error drift, plateau, or degrade consensus.
7. Insights, Limitations, and Extensions
- Key Insights: Structured debate with value-aligned roles enables cost-effective SLM ensembles to match semantic richness and reliability of large judge models for safety evaluation; fixed safety aspects anchor reasoning; maximal benefit realized after three rounds.
- Limitations: HAJailBench covers 100 goals × 12 attacks; unseen or emergent attacks may challenge reliability. SLM-based agents are susceptible to multi-turn hallucinations and bias stacking. Human annotation in HAJailBench references a large-model judge in the second round, potentially introducing bias.
- Potential Extensions: Dynamic topic reallocation, uncertainty-aware adaptive stopping, adversarial robustness mechanisms in debaters, human-in-the-loop review, continual learning, and cross-cultural calibration are identified as promising future directions.
A Multi-Agent Judging Framework, as exemplified by the Critic–Defender–Judge protocol (Lin et al., 9 Nov 2025), organizes refinement and consensus-building in safety evaluation via structured, prompt-conditioned SLM agents, with provable gains in semantic fidelity and cost efficiency on large-scale adversarial datasets. This modular debate architecture is effective in capturing and aligning nuanced jailbreak risks, setting a precedent for scalable, interpretable LLM safety assessment.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free