LLM-Jury Filtering: Ensemble Decision-Making

Updated 10 September 2025

LLM-Jury Filtering is a framework that integrates outputs from multiple LLMs and annotator surrogates, preserving diverse perspectives in a structured jury for decision-making.
It employs techniques like dynamic juror sampling, weighted aggregation, and quadratic programming for counterfactual exploration to enhance fairness and transparency.
The approach is applied in high-stakes domains such as content moderation, legal reasoning, and data curation, ensuring robust, interpretable, and auditable outcomes.

LLM-Jury Filtering is a framework and technical paradigm that orchestrates the outputs, judgments, or recommendations of multiple LLMs, model representations, or annotator surrogates, integrating them in a structured ensemble where the concept of a "jury"—composed of individually modeled agents or annotators—explicitly guides decision-making, filtering, and evaluation. Unlike traditional aggregation methods such as majority vote, LLM-Jury Filtering operates on the premise that individual contributors' biases, dissent, demographic backgrounds, and idiosyncratic perspectives should be both preserved and made tractable for dynamic reweighting, adjudication, and auditing. The design of such systems enables the explicit emulation, weighting, and counterfactual manipulation of jurors' input, supporting interpretability, fairness, robust counterfactual analysis, and enhanced domain sensitivity across subjective and complex labeling tasks.

1. Core Principles of Jury Learning and Explicit Model Aggregation

Jury learning is foundational to LLM-Jury Filtering; it is instantiated by constructing models that capture individual annotator behavior, socio-demographic group structure, and the interaction between annotator and input features (Gordon et al., 2022). Each annotator is endowed with a unique embedding, and group-level information aggregates over shared characteristics. In practice, jury learning enables inference by sampling a population of jurors according to a user-defined sheet (specifying demographic balance, prevalence, or targeted inclusion) and performing verdict aggregation over individual predictions. The architecture leverages deep LLMs (such as BERTweet) for content embedding and a recommender-style DCN for modeling multi-field interactions. The optimization for counterfactual analysis is formalized as a quadratic program:

$\text{minimize} \ \sum_k (p_k - p_k^*)^2$

subject to

$\sum_k p_k^* = n_{\text{jurors}}, \quad v_{p^*} > 1, \quad p_k^* \geq 0$

with

$v_{p^*} = \frac{\sum_k p_k^* s_k}{n_{\text{jurors}}}$

where $p$ is the jury allocation, $s_k$ are individual scores, and $v_{p^*}$ is the aggregated verdict.

This explicit modeling contrasts with traditional label aggregation, which implicitly suppresses minority or dissenting voices, often resulting in a single classifier trained on majority-voted labels. LLM-Jury Filtering instead seeks to transparently enumerate, preserve, and manipulate dissent and minority perspectives.

2. Technical Approaches for Jury Filtering in LLM-Based Systems

LLM-Jury Filtering draws on several architectures and ensemble principles:

Jury Sampling and Verdict Aggregation: At inference, the system dynamically selects juror surrogates via annotated embeddings, demographic criteria, or model agents, aggregating verdicts through majority, weighted, or programmable schemes.
Counterfactual Exploration: The system computes the smallest change to jury composition (via quadratic programming) necessary to alter the verdict, explicitly quantifying the influence of each juror or group.
Visualization Interfaces: Decision-makers inspect the full juror vote distribution, demographic breakdowns, and historical annotation bias, facilitating transparent uncertainty estimation and dissent visualization.
LLM Ensemble Analogues: Recent advances extend jury learning to ensembles of LLMs, generating multiple outputs or chain-of-thought explanations and allowing weighted or model-informed aggregation (Sun et al., 26 Mar 2024). Transformer-based modules ingest retrieved in-context reasoning exemplars and fuse them for downstream prediction, analogously serving as "jury" opinion integration.
Quantitative Judge Alignment: Post-hoc regression models (GLMs) calibrate LLM-as-judge outputs against human ratings, optimizing for empirical alignment rather than raw LLM scoring (Sahoo et al., 3 Jun 2025).

3. Applications in Subjective, High-Stakes, and Multi-Agent Domains

LLM-Jury Filtering has major implications for domains characterized by subjectivity, value conflict, or stakeholder diversity:

Online Toxicity and Content Moderation: Juror selection can over-represent targeted minorities, leading to more community-reflective labeling of harassing speech (Gordon et al., 2022). Interactive jury composition allows adaptation to context or stakeholder shifts.
Legal Reasoning and Judicial Systems: Annotator modeling is directly translatable to legal diagnostics, where LLM-agents simulate jurors and deliberate on compliance or violation scenarios. Systems such as AutoLaw combine adversarial scenario generation, ranked LLM-based jurors, and structured voting to assess legal violation likelihood (Nguyen et al., 20 May 2025). Toolkit-enabled frameworks regularly audit fairness and bias in judicial LLM "jury" settings (Hu et al., 14 Jul 2025).
Collaborative Filtering and Recommendations: Ensembles of LLM-generated chain-of-thought outputs, retrieved for in-context similarity, are aggregated by transformer decoders to emulate jury reasoning in recommendation predictions (Sun et al., 26 Mar 2024), supporting nuanced, context-sensitive filtering.
Data Quality and Selection: LLM-jury style mechanisms filter training data for large models at line-level and document-level granularity, relying on individual agent or annotator surrogates to rate and categorize data for downstream curation (Henriksson et al., 13 Jan 2025, Wang et al., 8 May 2025).

4. Counterfactuals, Dissent, and Robustness to Jury Composition

The architecture's ability to compute and visualize counterfactual changes in jury composition is technically significant (Gordon et al., 2022). Using quadratic programming, the system identifies how many and which jurors' votes must change—or which demographic groups should be reweighted—to flip the classifier's verdict. This not only provides rigorous robustness estimates but also operationalizes dissent as first-class output. Visualizations of full juror vote distributions, demographic overlays, and dissent snapshots are standard.

In ensemble LLM studies, the Condorcet Jury Theorem is investigated for theoretical and empirical performance limits (Lefort et al., 26 Aug 2024). The theorem predicts accuracy scaling with jury size under independence, yet findings demonstrate limited gains due to correlated errors among even advanced LLMs—implicating the need for maximizing model error independence in jury filtering.

5. Multi-Agent Reputation Filtering and Dynamic Arbitration

In collaborative multi-LLM-agent systems, dynamic reputation filtering expands the notion of jury arbitration (Lou et al., 6 Sep 2025). Interactive rating networks are constructed, where agents or jurors evaluate peers and update reputations via mathematically formalized increment and decay rules:

$r_i^t = r_i^{t-1} + w_i^t \times (1 - r_i^{t-1}) \times \alpha$

(for increment if $w_i^t \geq w_0$ )

$r_i^t = r_i^{t-1} - (w_i^t \times r_i^{t-1}) \times \beta$

(for decay if $w_0 > w_i^t$ )

Agent selection is optimized using an Upper Confidence Bound (UCB)-based strategy:

$S_i^t = \delta \cdot r_i^{t-1} + (1-\delta) \cdot c_i^t + x_i^{t-1}$

with exploration weighted by prior selection frequency. This architecture enables robust, adaptive filtering of unreliable or low-performing jurors with demonstrable improvements in code generation and reasoning task efficiency.

6. Fairness, Bias, and Auditability in Jury-Inspired LLM Filtering

Evaluation frameworks for LLM judicial fairness highlight systematic biases, inconsistency, and imbalanced inaccuracy in jury-mimetic LLM decision-making (Hu et al., 14 Jul 2025). Key metrics include

Inconsistency: Weighted proportion of verdict changes under counterfactual label perturbation.
Bias: Regression-based detection of systematic directional prediction shifts across label values.
Imbalanced Inaccuracy: Differential error rates distributed over demographic or procedural factors.

Parameter studies reveal that temperature increases can reduce bias (by increasing inconsistency), but model size or recency does not correlate with fairness improvements. The existence of accuracy-equity trade-offs suggests that integrating fairness auditing—and multi-dimensional jury composition monitoring—remains critical to trustworthy jury filtering in high-stakes applications.

7. Practical Impact and Future Directions

LLM-Jury Filtering is central to the advancement of interpretable, adaptive, and auditable model-based decision-making:

It enables participatory classifier voice selection, supporting stakeholder governance and ethical iteration.
It provides technical tools for transparency, robust dissent visualization, counterfactual and sensitivity analysis, and real-time recalibration based on stakeholder or jurisdictional requirements.
It underpins scalable data curation pipelines, fairness-aware judicial assistance, and multi-modal ensemble reasoning in recommender and legal systems.

Open-source codebases, toolkits, and benchmark datasets (e.g., FinerWeb-10BT, JudiFair, JustEva) accelerate research into jury-mimetic architectures, ensemble calibration, fairness measurement, and hyperparameter evaluation. The approach fundamentally reimagines the aggregation of divergent voices—from annotators to LLM agents—and offers a rigorous foundation for filtering, evaluating, and auditing complex model outputs in domains where accuracy, equity, and stakeholder legitimacy are paramount.