PeerReview-Weighted DS-EM Model

Updated 1 February 2026

The paper introduces a PeerReview-Weighted DS-EM model that integrates reviewer reliability weights into the classic Dawid–Skene framework.
It employs an EM algorithm to estimate latent ground truth and annotator confusion matrices by incorporating both direct annotations and peer review feedback.
Extensions such as iterative weighted majority voting and soft-label variants improve convergence speed and accuracy in multi-agent debates and deep ensemble calibrations.

A PeerReview-Weighted Dawid-Skene EM model extends the classic Dawid–Skene (DS) approach for crowdsourced label aggregation to settings where reviewer accuracy varies, and where reviewer feedback can itself be weighted—either by peer review, multi-agent verification, or empirical performance. DS-EM interprets observed labels as arising from latent ground truth corrupted by individual annotator confusion matrices, which are estimated from the data via an Expectation-Maximization (EM) algorithm. Modern extensions include iterative weighted majority voting (IWMV), soft-label variants for deep ensembles, and peer-review-integrated aggregation for @@@@1@@@@ and scientific peer review contexts (Li et al., 2014, Kuzin et al., 10 Mar 2025, Cherian et al., 2 Dec 2025, Li et al., 2013, Borovac et al., 2022).

1. Dawid–Skene Generative Model and EM Formulation

The DS model posits $M$ annotators, $N$ items, and $L$ possible labels. Each label $y_j$ is latent with class prior $\pi_k$ , and annotators provide observed scores $Z_{ij}$ , governed by each worker's $L\times L$ confusion matrix $\Theta^{(i)}$ where $\theta^{(i)}_{k\ell} = P(\text{annotator }i\ \text{outputs }\ell|\text{true label}=k)$ (Li et al., 2014, Li et al., 2013). The observed-data log-likelihood is

$\ell_{\rm obs}(\pi,\Theta;Z) = \sum_{j=1}^N \ln \Bigl(\sum_{k=1}^L \pi_k \prod_{i:T_{ij}=1} \theta^{(i)}_{k,Z_{ij}} \Bigr)$

EM alternates:

E-step: Compute responsibility $ζ_{j,k}^{(t)}$ for each item and label,

$ζ_{j,k}^{(t)} = \frac{\pi_k^{(t)} \prod_{i:T_{ij}=1} \theta^{(i)(t)}_{k,Z_{ij}}}{\sum_{k'} \pi_{k'}^{(t)} \prod_{i:T_{ij}=1} \theta^{(i)(t)}_{k',Z_{ij}}}$

M-step: Update the priors and confusion matrices,

$π_k^{(t+1)} = \frac{1}{N} \sum_{j=1}^N ζ_{j,k}^{(t)},\quad θ^{(i)(t+1)}_{k\ell} = \frac{\sum_{j=1}^N T_{ij} ζ_{j,k}^{(t)} 1_{Z_{ij}=\ell}}{\sum_{j=1}^N T_{ij} ζ_{j,k}^{(t)}}$

Model extensions accommodate multi-class, ordinal, and peer-graded tasks (Cherian et al., 2 Dec 2025).

2. Peer-Review Integration and Weight Learning

PeerReview-weighted EM incorporates meta-reviewer feedback into the likelihood, with reflectors grading solver responses (e.g., through weights $w_i^c$ for correct/incorrect/abstain) (Cherian et al., 2 Dec 2025). The generative model is thus over two confusion matrices:

Solver matrix $P^t_{β,β_1}$ for solution labels versus truth.
Reflector matrix $Q^c_{α,α_2}$ for review weights versus latent correctness.

EM steps estimate $(π_α,ζ_β)$ for priors, update $P^t$ , $Q^c$ , and aggregate $p_i^y(β)=\sum_α γ_{i,α,β}$ to produce final consensus predictions. This paradigm generalizes to multi-agent LLM debates where reflectors may be themselves agents, and reliability grading is explicit (Cherian et al., 2 Dec 2025). The empirical benefit is more accurate identification of correct answers, as reflectors push the posterior toward answers they judge correct.

3. Weighted Majority Voting and Iterative Approximations

If each annotator (or solver) is characterized by a single reliability parameter $w_i$ , MAP inference under the “one-coin” DS model reduces to weighted majority voting (WMV):

$\hat y_j = \arg\max_k \Bigl( \sum_{i:T_{ij}=1} v_i 1_{Z_{ij}=k} \Bigr), \quad v_i = \ln\frac{(L-1) w_i}{1-w_i}$

For $w_i$ near $1/L$, this linearizes to $v_i \approx \frac{L}{L-1}(L w_i - 1)$ . The iterative approach (IWMV) updates weights $v_i$ according to per-worker agreement with current aggregated labels, $v_i \leftarrow L\hat w_i -1$ , with repeated WMV aggregation and weight recalibration until convergence (Li et al., 2014). One-step WMV (osWMV) executes a single update-and-vote cycle (Li et al., 2013).

4. Soft Dawid–Skene: Aggregation for Deep Ensembles

Soft Dawid–Skene (SDS) extends DS by replacing categorical labels with Dirichlet-modeled softmax outputs for each classifier (Kuzin et al., 10 Mar 2025). The generative process is

$t_i \sim \text{Categorical}(\nu),\quad c_i^{(k)}|t_i=j \sim \text{Dirichlet}(\pi_{j,\cdot}^{(k)})$

E-step computes posterior over true label,

$\log q_i(j) = \log \nu_j + \sum_k \sum_{\ell=1}^J [\pi_{j\ell}^{(k)} - 1] \log c_{i\ell}^{(k)} - \sum_k \{ \sum_\ell \log \Gamma[\pi_{j\ell}^{(k)}] - \log\Gamma[\sum_\ell \pi_{j\ell}^{(k)}] \}$

Polyak averaging stabilizes updates. Final consensus for instance $i$ in class $j$ is $q_i(j)$ , leveraging calibrated network confidences—critical for robust ensemble learning.

Experimentally, SDS outperforms ensemble averaging for accuracy, calibration (ECE, Brier score), and OOD detection on common vision benchmarks (Kuzin et al., 10 Mar 2025).

5. Theoretical Error Rate Bounds

Finite-sample exponential error rate bounds under the DS framework apply to WMV-aggregated predictions. If vote aggregation score $S_j(k)=\sum_{i:T_{ij}=1}v_i 1_{Z_{ij}=k}$ and $t = \frac{q}{(L-1)\|v\|_2} \sum_i v_i (Lw_i-1)$ , then error probability obeys

$\mathrm{MER} \leq (L-1)\exp\left(-\frac{t^2}{2}\right)$

Maximizing $\sum_i v_i(Lw_i-1)$ subject to $\|v\|_2=1$ yields the “oracle” WMV. Bernstein-type bounds further refine risk estimation (Li et al., 2014, Li et al., 2013). Empirically, EM-MAP and IWMV match or outperform full EM, with IWMV typically 50–100× faster (Li et al., 2014). Stability requires sufficient data per annotator; under-sampled confusion matrices can degrade aggregation (Borovac et al., 2022).

6. Practical Implementation and Applications

PeerReview-weighted DS-EM is applied in reviewer aggregation for scientific peer review, LLM multi-agent debate, EEG signal detection, and deep ensemble calibration (Li et al., 2014, Cherian et al., 2 Dec 2025, Borovac et al., 2022, Kuzin et al., 10 Mar 2025). Key workflow steps:

Collect reviewer scores, possibly with peer weighting of review reliability.
Fit DS by EM: estimate confusion matrices and class priors.
Aggregate per-item predictions by WMV or post-EM MAP.
For multi-stage protocols (e.g., debate), aggregate responses and meta-reviews, then run EM on pooled data.

Typical EM complexity is $O(MNL^2)$ per iteration for full DS; IWMV reduces this to $O(MNL)$ (Li et al., 2014). Convergence monitoring and initialization protocols are well described in EEG seizure detection applications, with practical limits on data volume required for robust confusion parameter estimation (Borovac et al., 2022).

7. Extensions, Limitations, and Cross-Method Comparison

Extensions: Multi-class extension via $K\times K$ confusion matrices, hierarchical DS, temporal and adversarial annotator modeling (Cherian et al., 2 Dec 2025, Li et al., 2013).
Limitations: Sensitivity to reviewer/detector quality variance; poor initializations may lead to suboptimal maxima (Borovac et al., 2022, Cherian et al., 2 Dec 2025).
Cross-method comparison: DS-EM is generative versus discriminative weighted mean stacking, which requires labeled calibration. WMV stacking is more robust when annotator quality varies widely; DS best exploits strong individual detectors (Borovac et al., 2022). SDS is preferred when soft-label information is available (Kuzin et al., 10 Mar 2025).

The Dawid–Skene EM family and its peer-review-weighted extensions provide a unified framework for reliable aggregation in settings with noisy, potentially weighted annotator or agent feedback, supported by finite-sample theoretical guarantees and empirical evidence (Li et al., 2014, Li et al., 2013, Cherian et al., 2 Dec 2025, Borovac et al., 2022, Kuzin et al., 10 Mar 2025).