Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Agent Judging Framework

Updated 16 November 2025
  • Multi-Agent Judging Framework is a system where specialized SLM/LLM agents (Critic, Defender, Judge) collaborate through structured rounds to evaluate content safety.
  • The framework uses iterative debate with fixed safety aspects and formal aggregation rules to refine evaluations and improve semantic alignment.
  • Empirical results show that this approach enhances reliability and reduces inference costs compared to traditional large-model judges.

A Multi-Agent Judging Framework comprises a set of autonomous agents—each typically instantiated as a Small LLM (SLM) or LLM and assigned distinct roles—that collectively evaluate responses to adversarial prompts or other tasks through structured debate and consensus-building. This paradigm is particularly prominent in scalable safety assessment of LLMs, where cost reduction and semantic fidelity are crucial. The approach leverages adversarial role conditioning, inter-agent debate, and formal aggregation rules to capture nuanced violations and achieve reliability comparable to high-cost frontier models, but at an order-of-magnitude lower inference cost.

1. Architecture: Critic, Defender, and Judge Agents

The framework (Lin et al., 9 Nov 2025) instantiates three core agents:

  • Critic Agent (C): Given a prompt–response pair (x,y)(x, y), the Critic scores the response across K=5K = 5 fixed safety aspects (e.g., toxicity, privacy, illegal advice), producing for each aspect aka_k a sub-score uk[1,10]u_k \in [1,10], a natural-language critique, an aggregated risk level (1–5), and an overall ten-point score.
  • Defender Agent (D): Receives Critic outputs (uk,mk)(u_k, m_k) and for each aspect rebutts with counter-score vk[1,10]v_k \in [1,10] and a defense explanation, aiming to lower perceived risk and simulate adversarial robustness.
  • Judge Agent (J): After RR rounds of Critic–Defender interaction, the Judge integrates all arguments and outputs final attack success (binary), five-level risk, and ten-point continuous risk, with a rationale.

Agent specialization is achieved by prompt engineering and context conditioning; agents share the same base model, but system-level prompts encode distinct roles and evaluation rubrics.

2. Debate Protocol and Value Alignment

Each evaluation proceeds through structured rounds:

  • Pre-Debate Alignment: All agents are initialized with a shared, explicit taxonomy of KK safety aspects to constrain scope and prevent topic drift.
  • Round Structure: In each round rr, Critic first produces per-aspect scores and messages; Defender rebuts with counter-scores and responses. This is strictly turn-taking; agents do not directly alter one another's scores from previous rounds.
  • Stopping Criterion: R=3R = 3 rounds empirically yields optimal accuracy–cost tradeoff; further rounds introduce error accumulation and diminish returns. Optionally, early exit occurs if scores stabilize without substantive changes.

Three rounds are found to be the “sweet spot”: maximal improvement in semantic reconciliation with token cost growing linearly in RR.

3. Formal Scoring and Aggregation Mechanics

Judging is formalized by aggregation equations:

  • Critic per-aspect utility:

ur,k=Uc(x,y;ak),Uc:(x,y,a)[1,10]u_{r, k} = U_c(x, y; a_k), \quad U_c: (x, y, a) \mapsto [1,10]

  • Aggregate per round:

Ucr(x,y)=1Kk=1Kur,kU_c^r(x, y) = \frac{1}{K}\sum_{k=1}^K u_{r, k}

  • Defender counter-argument:

vr,k=Sd(x,y,ur,k,ak),Sd:(x,y,u,a)[1,10]v_{r, k} = S_d(x, y, u_{r, k}, a_k), \quad S_d: (x, y, u, a) \mapsto [1,10]

  • Defender aggregate:

Udr(x,y)=1Kk=1Kvr,kU_d^r(x, y) = \frac{1}{K}\sum_{k=1}^K v_{r, k}

  • Judge decision rule:

Dj(c1,d1,,cR,dR)=λ1Rr=1RUcr+(1λ)1Rr=1RUdrD_j(c_1, d_1, \ldots, c_R, d_R) = \lambda \frac{1}{R} \sum_{r=1}^R U_c^r + (1-\lambda) \frac{1}{R} \sum_{r=1}^R U_d^r

Continuous final risk score Dj[1,10]D_j \in [1, 10]; attack success is declared if Dj>τD_j > \tau (threshold τ\tau selected by maximizing Cohen's κ\kappa on validation). Mixing weight λ\lambda and threshold τ\tau are tuned for maximum ground-truth agreement.

4. Calibration, Prompt Engineering, and Training

No gradient-based fine-tuning is applied. Calibration relies on:

  • Prompt Engineering: All agents utilize rigorously engineered prompts encoding evaluation taxonomy and roles.
  • Pre-Debate Alignment: Ensures agents evaluate on the same KK aspects.
  • Noise Filter: Optional pre-processing to eliminate adversarial artifacts from inputs with additional LLM-based filtering.
  • Threshold/Weight Tuning: Only τ\tau (success boundary) and λ\lambda (critic–defender mixing) are learned, aligning agent consensus with HAJailBench human labels.

5. Inference Cost Reduction and Resource Efficiency

Each inference comprises $2R + 1$ SLM calls (R=3R=3 implies 7 calls); SLMs such as Qwen3-14B incur \sim5x less unit cost than GPT-4o judges. Overall, the multi-agent protocol requires only 0.46×0.46\times the cost of a single GPT-4o call, even when accounting for extra rounds. Lower per-call context size and parallelization further enhance token efficiency.

6. Empirical Evaluation: HAJailBench Benchmark

  • Dataset: HAJailBench collects 12,000 adversarial instances (100 harmful goals, 12 attack methods, multiple target LLMs) with expert-annotated ground truth (binary success/fail, 5-level risk, 10-point risk).
  • Metrics: Cohen’s κ\kappa (agreement with ground truth), unit cost/query, cost ratio with GPT-4o baseline.
  • Results:
    • GPT-4o judge: κ=0.7627\kappa=0.7627, cost = \$8.36×10⁻⁴/query
    • Qwen3-14B multi-agent: κ=0.7352\kappa=0.7352, cost = \$3.85×10⁻⁴/query (54% lower)
    • Multi-agent improves over single SLM JailJudge by 25–32% relative κ\kappa gain, reduces cost by 54–82%.

Debate Round Ablation

Rounds κ\kappa Cost Ratio vs Optimal (R=3)
0 0.5709 0.30
1 0.6955 0.69
2 0.7143 0.87
3 0.7352 1.00 (optimal)
4 0.7260 1.05
5 0.7221 1.12

Accuracy sharply increases up to three rounds; further rounds introduce error drift, plateau, or degrade consensus.

7. Insights, Limitations, and Extensions

  • Key Insights: Structured debate with value-aligned roles enables cost-effective SLM ensembles to match semantic richness and reliability of large judge models for safety evaluation; fixed safety aspects anchor reasoning; maximal benefit realized after three rounds.
  • Limitations: HAJailBench covers 100 goals × 12 attacks; unseen or emergent attacks may challenge reliability. SLM-based agents are susceptible to multi-turn hallucinations and bias stacking. Human annotation in HAJailBench references a large-model judge in the second round, potentially introducing bias.
  • Potential Extensions: Dynamic topic reallocation, uncertainty-aware adaptive stopping, adversarial robustness mechanisms in debaters, human-in-the-loop review, continual learning, and cross-cultural calibration are identified as promising future directions.

A Multi-Agent Judging Framework, as exemplified by the Critic–Defender–Judge protocol (Lin et al., 9 Nov 2025), organizes refinement and consensus-building in safety evaluation via structured, prompt-conditioned SLM agents, with provable gains in semantic fidelity and cost efficiency on large-scale adversarial datasets. This modular debate architecture is effective in capturing and aligning nuanced jailbreak risks, setting a precedent for scalable, interpretable LLM safety assessment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Agent Judging Framework.