Probability-Based Score Aggregation

Updated 29 May 2026

Probability-based score aggregation is a set of methods that integrate diverse expert and model predictions into a single calibrated forecast using formal probabilistic models.
It employs Bayesian and statistical frameworks to model uncertainty, account for information diversity, and adjust for source reliability and calibration biases.
Techniques such as proper scoring rules, online algorithms, and confidence-weight adjustments ensure improved accuracy and optimal decision-making in complex forecasting scenarios.

Probability-based score aggregation is a class of methodologies for combining heterogeneous probabilistic or real-valued predictions, judgments, or estimations from multiple sources—such as experts, models, or human annotators—into an optimal or principled single score or forecast. Such aggregation seeks to integrate information, account for uncertainty, and calibrate according to the reliability and dependence structure of the sources, often via explicit statistical, Bayesian, or scoring-rule frameworks.

1. Foundations: Statistical and Bayesian Frameworks

Probability-based score aggregation is rooted in formal probabilistic modeling. Classic Bayesian settings posit that each agent observes independent or partially overlapping signals about an underlying variable or event, leading to heterogeneity due to information diversity rather than exclusively measurement noise. This perspective is essential in contexts such as prediction polling or crowdsourced forecasting, where an agent’s belief $p_j = P(Y=1\,|\,I_j)$ encodes the probability of a binary event $Y$ conditional on their information set $I_j$ . The aggregation task is to compute an optimal posterior $P(Y=1\,|\,p_1,\ldots,p_N)$ under a model capturing either measurement noise, information diversity, or both (Satopää et al., 2015).

Generalizations include hierarchical models in which agents’ confidence or expertise is represented via conjugate prior hyperparameters encoding equivalent sample size (Frongillo et al., 2014), generative models of expert log-odds scores under correlated noise (Kahn, 2012), or signal-passing in Bayesian networks with cross-panel aggregation for group decision-making (Leonelli et al., 2017).

2. Aggregation Principles: Information Diversity and Dependence

A central advance is recognizing that heterogeneity among forecasts is often due to diverse information access rather than just noise. The Partial Information Framework (PIF) explicitly models each forecaster as receiving (possibly overlapping) Gaussian information signals $Z_j$ , linked to calibrated probabilities $p_j$ via a probit transformation. The covariance structure $\Sigma$ encodes private and shared signals between forecasters, and the coherent aggregated forecast exploits multivariate normal conditioning:

$p_{\mathrm{agg}} = \Phi\left(\frac{\Sigma_{0,Z}\Sigma_{Z,Z}^{-1}P - t}{\sqrt{1-\Sigma_{0,Z}\Sigma_{Z,Z}^{-1}\Sigma_{Z,0}}}\right)$

where $P = (\Phi^{-1}(p_j))_{j=1}^N$ (Satopää et al., 2015). This framework fundamentally differs from classical measurement-error pooling by yielding "extremization" beyond the range of individual forecasts and requiring high-dimensional Bayesian regression in probit space, not linear averaging in logit space.

Generative Bayesian aggregation similarly models each expert's score as a noisy observation, allowing explicit incorporation of expert-specific bias, calibration (under- or over-confidence), accuracy, and dependence via the covariance or correlation matrix. The optimal LogOps (logarithmic opinion pool) weights are computed analytically from these parameters (Kahn, 2012).

3. Proper Scoring Rules and Online Aggregation Algorithms

Proper scoring rules, particularly the Continuous Ranked Probability Score (CRPS), are deeply entwined with aggregation tasks where forecasts are distributions or ensembles. The CRPS quantifies the $L^2$ distance between a forecast CDF and the observed outcome and, importantly, is mixable: for any weights $Y$ 0 and expert CDFs $Y$ 1, there exists a forecast $Y$ 2 such that

$Y$ 3

with explicit aggregation rule (V'yugin et al., 2019). This property enables the use of Vovk's Aggregating Algorithm (AA) for online combination, guaranteeing that over any time horizon the cumulative CRPS of the learner does not exceed that of the best expert by more than $Y$ 4. The weights evolve via exponential updates proportional to past CRPS, and the forecast at each round is a non-linear mixture (not a simple average) of expert distributions (V'yugin et al., 2021). Extensions—including "smooth" specialized experts with context-dependent confidences—maintain regret guarantees and adaptability (Zamo et al., 2020).

Fair and almost-fair CRPS modifications, as used in AIFS-CRPS for weather ensemble training, correct for bias due to finite ensemble size, enabling differentiable, proper loss functions for stochastic neural forecasting models and improving calibration and sharpness versus deterministic alternatives (Lang et al., 2024).

4. Aggregation with Confidence, Expertise, and Structured Signals

Aggregation decisions frequently benefit from explicit modeling of source confidence, expertise, or problem-relevant structure. One approach elicits not only each agent's predictive distribution but also their "confidence," operationalized as equivalent sample size in a conjugate prior. Under suitable uniqueness conditions, the principal can invert the reported distributions to infer hyperparameters and aggregate as if all agents' data were pooled (Frongillo et al., 2014).

Further, Bayesian models incorporate peer predictions—i.e., agents' beliefs about others' responses—into the aggregation process, as in the Possible Worlds Model, which infers the most probable world state even if it is a minority view (McCoy et al., 2017). These models allow for estimation of both agent-level expertise/competence and group-level parameters, with posterior inference typically performed by MCMC.

5. Aggregation of Rankings, Preferences, and Multimodal Signals

Rank and preference aggregation often moves beyond direct probability pooling to structured statistical estimation under comparability or transitivity assumptions. In pairwise comparison data, the Maximum Score Estimator under Weak Stochastic Transitivity (WST) seeks a ranking that maximally agrees with the observed outcomes, requiring only the minimal WST property and obviating strong monotonicity requirements of classical models. The estimator is proven consistent and (nearly) minimax optimal in Kendall's tau (Zhang et al., 8 Oct 2025).

When both ratings and comparisons are available, as in SCoRa, a probabilistic MAP estimator unifies both forms using a convex loss. The combined use of direct and comparative judgments is strictly superior for accurate identification of top-ranked entities under realistic cost constraints (Fageot et al., 8 Feb 2026).

Correlation-aware aggregation becomes essential when sources are algorithmic, with potentially complex dependence. In such cases, aggregation via maximum likelihood using the full covariance structure achieves optimality, while Embedded Voting methods provide near-optimal performance even with limited or no training data by uncovering low-dimensional structure and groupings without explicit noise estimation (Delemazure et al., 2023).

6. Aggregation in Multivariate and Group Decision-Making Contexts

In high-dimensional or spatially structured settings—meteorology, spatial statistics, decision support—aggregation must accommodate complex dependence structures and targeted evaluation criteria. Aggregation-and-Transformation frameworks generalize proper scoring rules by transforming (projecting, patching, or recombining) multivariate CDFs, applying base univariate/multivariate scoring rules (e.g., CRPS, variogram, anisotropy scores), and aggregating to form composite scores tailored to specific forecast features. The framework's algebraic properties ensure preservation of propriety under transformation and aggregation, and allow construction of scores sensitive to different aspects (marginals, tails, spatial coherence, etc.) (Pic et al., 2024).

In group decision-making, coherent Bayesian aggregation is possible even when expert panels oversee only disjoint subvectors of a high-dimensional outcome, provided a polynomial (algebraic) conditional expected utility can be expressed in terms of panel-specific summaries, and a quasi-independence (moment-separability) condition holds. Only a small vector of moment summaries per panel is required for global optimality, with closed-form formulas for aggregated utility scores available in terms of these moments (Leonelli et al., 2017).

7. Calibration, Abstention, and Finite-Sample Guarantees

Aggregation uncertainty is a paramount concern, particularly for high-stakes predictions such as chain-of-thought (CoT) reasoning in LLMs. Score-weighted aggregation across sampled reasoning paths, with abstention calibrated by conformal risk control, yields finite-sample guarantees on the maximum confident-error rate—the probability of answering when wrong. The abstention threshold is computed using calibration data, ensuring that the realized confident-error rate never exceeds the target level out-of-sample. Analysis reveals that "score separability" is necessary and sufficient for abstention to strictly improve selective accuracy, with closed-form predictors for post-abstention accuracy derived from calibration distributions (Gu et al., 13 May 2026).

In summary, probability-based score aggregation, as formalized in the recent literature, encompasses a spectrum of advanced statistical, Bayesian, decision-theoretic, and online-learning methodologies for synthesizing uncertain or subjective information. Methods are diverse—ranging from generative Bayesian pooling, scoring-rule minimization, and spectral embedding, to algebraic group-based synthesis and calibration-aware abstention—but are unified by a reliance on probabilistic principles, proper scoring, and explicit modeling of heterogeneity, dependence, and uncertainty. These advances have delivered theoretical guarantees, improved empirical performance, and broad applicability across forecasting, preference learning, group decision-making, and selective machine reasoning.