LLM Jury-on-Demand Framework

Updated 8 December 2025

LLM Jury-on-Demand is a dynamic evaluation framework that assembles multiple, context-sensitive LLM jurors to render verdicts on model outputs.
It leverages modular pipelines, including featurization, dynamic juror selection, and reliability-weighted aggregation to maximize fidelity with human judgment.
Empirical benchmarks show improved accuracy and reduced bias, making it suitable for legal, factual, and peer review applications in real-world settings.

The LLM Jury-on-Demand paradigm defines a class of scalable, flexible, and context-sensitive evaluation frameworks that assemble multiple LLM "jurors"—selected, orchestrated, and aggregated on-demand—to render verdicts on model outputs, system behaviors, or real-world decision points. Unlike static panels or single LLM judges, Jury-on-Demand systems dynamically tailor jury composition, voting algorithms, and reliability weighting to the specifics of each evaluation instance, leveraging architectural, statistical, and learning-theoretic tools to maximize fidelity with human judgment, minimize bias, and support high-stakes or domain-sensitive applications.

1. Foundations and Motivations

The emergence of LLM Jury-on-Demand systems addresses crucial limitations of both human-only and single-model evaluation in contexts such as law, factuality verification, summarization, code review, and judicial process transparency. Manual human evaluation, though reliable, is prohibitively slow and costly for real-time, large-scale deployments. Single LLM "judges" and static LLM juries suffer from systematic bias, limited adaptability, and suboptimal agreement with domain experts or legal practitioners (Li et al., 1 Dec 2025, Juvekar et al., 19 Oct 2025). Jury-on-Demand frameworks aim to overcome these issues by introducing mechanisms for context-aware juror selection, dynamic reliability prediction, and statistically principled aggregation, enabling both greater scalability and higher agreement with ground-truth or domain-expert standards.

2. Architectural Building Blocks and Jury Formation

LLM Jury-on-Demand systems instantiate modular pipelines composed of three core components:

Juror Pool: A heterogeneous set of LLM judge agents, potentially spanning different model families (e.g., GPT-4o, Qwen, Llama), training paradigms, or specialization domains (e.g., legal, RAG, conversational) (Kalra et al., 25 Feb 2025, Nguyen et al., 20 May 2025).
Dynamic Jury Selection: At inference, each evaluation instance is featurized (e.g., via text complexity, domain attributes, task metadata), and reliability predictors—learned per-judge, per-task binary classifiers—score each potential juror on the likelihood of human agreement. Examples include the Jury-on-Demand reliability model (Li et al., 1 Dec 2025):

$r_j(x) = f^j_θ(φ(x)) \in [0,1]$

The top- $K$ most reliable jurors for each $x$ are selected to comprise $J_*(x)$ .

Aggregation Algorithms: Juror votes or scores are combined using majority vote, reliability-weighted averaging, or hierarchical pipelining (e.g., debate → self-verification → mean pooling) (Kalra et al., 25 Feb 2025, Ramnath et al., 26 May 2025). Aggregation weights are typically calibrated either uniformly or via judge- or instance-specific reliability:

$S(x) = \frac{ \sum_{j \in J_*(x)} r_j(x) s_j(x) }{ \sum_{j \in J_*(x)} r_j(x) }$

Critically, both the constitution of the jury and the aggregation function can be tuned to maximize agreement with a downstream gold standard (e.g. human annotators), subject to latency or budget constraints.

3. Statistical Theory and Empirical Benchmarks

Jury-on-Demand frameworks leverage formal ensemble theory to justify and bound aggregation performance:

Condorcet Jury Theorem: For independent jurors with accuracy $p_i > 1/k$ , majority voting ensures, in theory, that error probability declines with ensemble size ( $n$ ), provided decisions are independent (Lefort et al., 26 Aug 2024). In practice, high pairwise correlation among LLMs limits gains, establishing the necessity of model diversity and calibration.
Reliability Testing: Modern pipelines implement not only accuracy and correlation benchmarks (Pearson’s $r$ , Cohen’s $\kappa$ ), but also nuanced agreement analyses such as the “Turing Test for judges” (|z| < 1 for human-likeness, z > 1 for super-consistency) (Han et al., 10 Oct 2025). This enables systematic ranking and filtering of candidate jurors according to their empirical agreement patterns with humans.

Framework	Jury Selection	Aggregation	Context Adaptivity
Verdict (Kalra et al., 25 Feb 2025)	Pre-specified models	MeanPool, debate, self-check	Fixed
AutoLaw (Nguyen et al., 20 May 2025)	Per-case expertise + diversity	Weighted/majority vote	High (adversarial, per-case)
Jury-on-Demand (Li et al., 1 Dec 2025)	Learned reliability predictors (instance-wise)	Reliability-weighted mean	Maximal
Judge's Verdict (Han et al., 10 Oct 2025)	Two-step (correlation ≥ 0.80; Turing check)	Weighted/unweighted mean	Tiered

This empirical and theoretical framework ensures the selection of jurors and aggregation of their outputs are systematically justified and auditable.

4. Voting, Weighting, and Bias Minimization

Jury-on-Demand systems utilize a suite of voting and aggregation strategies, including:

Majority vote: $\hat y = \arg\max_{k \in \mathcal K} \sum_{i=1}^n 1[y_i = k]$
Weighted vote by reliability: $\hat y = \arg\max_{k \in \mathcal K} \sum_{i=1}^n r_i(x) 1[y_i = k]$
Pipelined voting: Decision cascades (e.g., dialog-act judge → maxim judge → reward model (Ramnath et al., 26 May 2025))
Bias reduction: Diversity-penalty re-ranking, adversarial instance generation, and cross-family ensembles (Nguyen et al., 20 May 2025, Lefort et al., 26 Aug 2024)

These procedures are augmented by empirical measurement of inter-judge variance, reliability-aware thresholding, and dynamic adversarial data synthesis to further reduce susceptibility to groupthink, model collusion, or domain drift.

5. Application Domains and Representative Workflows

LLM Jury-on-Demand architectures feature in numerous specialized use-cases:

Legal and Judicial Reasoning: Automated examination scoring and legal drafting support under jurisdiction-specific protocols (e.g., India’s Supreme Court AoR examination) (Juvekar et al., 19 Oct 2025), adversarial legal compliance testing (Nguyen et al., 20 May 2025), and transparent audit pipelines for bias in jury selection (Shastri et al., 16 Aug 2024, Karp et al., 6 Nov 2025).
Model Evaluation and Fact-Checking: Adaptive jury selection for maximizing human agreement in text summarization, RAG, code review, and hallucination detection (Li et al., 1 Dec 2025, Kalra et al., 25 Feb 2025).
Conversational and Complex Peer Review: Multi-phase, linguistically structured juries for multi-turn conversational assessment, using dialog acts and Gricean maxims (Ramnath et al., 26 May 2025), as well as crowd-comparative evaluation for completeness in chain-of-thought judgments (Zhang et al., 18 Feb 2025).
Software Engineering Judging: Task-specialized pipelines with dynamic team selection across multiple evaluative strategies, as in SE-Jury (Zhou et al., 27 May 2025).

Typical workflows orchestrate real-time ingestion, featurization, juror selection, multi-agent LLM prompt execution, aggregation, and audit log production, often under explicit procedural or compliance constraints.

6. Performance Benchmarks and System Limitations

Empirical studies demonstrate that LLM Jury-on-Demand systems can achieve:

Consistent improvements in detection rates and benchmark correlation over naive majority voting and over single-best judges (e.g., +6.7% accuracy gain in crowd-comparative evaluation (Zhang et al., 18 Feb 2025), +11–22 points in violation detection (Nguyen et al., 20 May 2025), human-level $\kappa$ in SE-Jury (Zhou et al., 27 May 2025)).
Near-human or "indistinguishable" judgment quality under human-likeness z-score thresholds (|z| < 1) (Han et al., 10 Oct 2025), with super-consistent panels available for compliance-critical contexts.
Task-specific reliability conditioning (requiring e.g. $F1_{model}\geq 0.95\cdot F1_{human}$ , citation precision $\ge 0.98$ , procedural compliance $=1$ ) as operationalized for legal workflows (Juvekar et al., 19 Oct 2025).

Limitations center on:

Residual correlated failure modes where jurors are not independent (Lefort et al., 26 Aug 2024)
Training-data dependence of reliability predictors and limited generalization across domains (Li et al., 1 Dec 2025)
Persistent susceptibility to hallucinations and procedural/formatting errors in legal tasks, mitigable only via human-in-the-loop governance and symbolic knowledge base checks (Karp et al., 6 Nov 2025, Juvekar et al., 19 Oct 2025)

7. Future Directions and Open Challenges

Areas for continued research and engineering include:

Semi/self-supervised reliability prediction to reduce the need for labeled calibration data and improve adaptability to new corpora or domains (Li et al., 1 Dec 2025).
Explicit debiasing and diversity-enforcing jury selection beyond re-ranking, including model family orthogonality and adversarial juror injection (Nguyen et al., 20 May 2025, Lefort et al., 26 Aug 2024).
Hierarchical, compositional, and task-adaptive juries (e.g., DA-then-Maxim-then-Reward pipelines (Ramnath et al., 26 May 2025), instance-specific team selection (Zhou et al., 27 May 2025)).
End-to-end transparency, auditing, and explainability pipelines for automated decisions in real-world judicial and high-stakes settings (Shastri et al., 16 Aug 2024).

The theoretical and empirical frameworks developed in recent literature enable systematic, reproducible, and context-aware construction of LLM Jury-on-Demand systems, supporting scalable high-fidelity evaluation and decision support across domains where human-like, reliable, and nuanced judgment is required.