Paradox Severity Indexing
- Paradox severity indexing is a quantitative framework that measures the magnitude and impact of paradoxes in multivariate evaluation systems.
- In LLM evaluation, the Paradox Severity Index quantifies divergence between judge-based conceptual accuracy and binary scoring, highlighting cases of measurement failure.
- By extending to weighted multi-issue voting, the index exposes methodological incoherence and enables transparent cross-domain comparisons.
Paradox severity indexing is a set of quantitative frameworks designed to measure the magnitude and impact of paradoxes arising in multivariate evaluation systems, particularly when different scoring modalities or aggregative schemes yield conflicting notions of “consensus” or correctness. This approach is especially salient in domains where classical test theory, judgment-based scoring, and exact-match metrics interact nontrivially, such as AI benchmark evaluation and multi-issue social choice. Paradox severity indices provide tight, interpretable numeric bounds on the worst-case discrepancy between intuitive, composite, or majority-based outcomes and those preferred under alternate or theoretically “natural” regimes, highlighting regions of methodological incoherence and enabling transparent cross-model and cross-domain comparisons.
1. Paradox Severity Index in LLM Evaluation
The Paradox Severity Index (PSI), as introduced in "The Catastrophic Paradox of Human Cognitive Frameworks in LLM Evaluation" (Reddy, 23 Nov 2025), quantifies the extent to which traditional binary scoring regimes diverge from judge-based conceptual accuracy in the evaluation of frontier LLMs. Specifically, PSI up-weights this divergence by the model’s Classical Test Theory (CTT)–scaled IQ, exposing cases where higher measured intelligence coincides with catastrophic measurement failure.
Formally, for model : where
- is the mean LLM-as-judge conceptual accuracy,
- is the mean exact-match binary accuracy across the same items,
- is the model’s scaled IQ score.
A worked example demonstrates the index:
- If , , ,
- The raw gap is ,
- PSI.
Empirical values (Table 7, (Reddy, 23 Nov 2025)) range from 0.39 to 0.60 across nine state-of-the-art LLMs; higher PSI indicates greater paradoxical misalignment between individually plausible metrics.
2. Severity Indexing in Weighted Multi-Issue Voting
In multi-issue collective decision making, paradox severity indexing addresses the systematic divergence between issue-wise majority and majority-supported proposals in the context of weighted binary voting, including Anscombe’s and Ostrogorski’s paradoxes. The critical parameter is the maximum average topic weight, denoted , representing the highest concentration of voter weight across topics.
For voters over binary issues, each with a unit-sum weight vector :
The worst-case distance from issue-wise majority over all cases with is bounded piecewise: $g_{\ell} \leq \begin{cases} \frac{1}{2} + \frac{\ell}{2}, & 0 < \ell < \frac{1}{3} \[6pt] 1 - \ell, & \frac{1}{3} \leq \ell \leq \frac{1}{2} \[6pt] \ell, & \frac{1}{2} < \ell < 1 \end{cases} \tag{%%%%15%%%%}$ This bound is tight for a dense set of and all (Baharav et al., 20 Feb 2025).
3. Interpretation Bands and Heuristics
In LLM evaluation (Reddy, 23 Nov 2025), PSI bands are heuristically interpreted:
- PSI : Mild paradox—judge and binary scores largely agree.
- PSI : Moderate paradox—substantial misalignment.
- PSI : Severe paradox—catastrophic measurement failure.
In weighted voting (Baharav et al., 20 Feb 2025), quantifies severity: small corresponds to mild paradoxes (per issue consensus aligns closely with proposal consensus), while large allows paradox-induced deviations up to full disagreement.
4. Complementarity with Item Response Theory and Judge Validation
PSI supplements latent ability modeling (2PL IRT; for ability, for difficulty, for discrimination), enabling a dual-axis analysis: ability and paradoxical misalignment. Judge-vendor validation—using rubric-based, cross-vendor LLM-as-Judge protocols—guarantees that isolates conceptual correctness robustly. PSI’s interpretability is conditional on rigorous conceptual scoring; absence of such undermines reliability.
In voting, is computed in time and flags “dangerous” distributions: high concentrations of topic weight induce severe paradoxes, while evenly distributed weights align with classical consensus bounds.
5. Generalization to Other AI and Social Choice Evaluation Domains
Paradox severity indexing is extensible wherever two qualitatively distinct scoring methods (one conceptually focused and validated) exhibit systematic divergence, and where a normative scaling factor (IQ, ELO, percentile) is available. Example applications:
- Computer vision: pixel-level matches vs. human annotation, weighted by validated gold-standard accuracy.
- Dialogue systems: automated BLEU vs. human perception, possibly multiplied by fluency or coherence norms.
- Robotics: sensor-driven success rates vs. expert evaluations, up-weighted by capability scores.
Necessary conditions for generalization: (a) Two complementary, domain-relevant scoring methods, (b) Reliable conceptual (often human or judge-based) scoring, (c) Anchoring by a cross-system scale where larger model-level gaps induce higher index values.
This suggests paradox severity indices can systematically expose architecture-specific failures in metric validity, facilitating domain-aware, paradigm-shifting evaluation frameworks.
6. Wagner’s Rule and Sufficient Paradox-Preclusion Conditions
A consequence in weighted multi-issue voting (Baharav et al., 20 Feb 2025) is a sufficient bound for paradox avoidance: If the average majority (sum of issue-wise majorities weighted by and normalized), then Anscombe's paradox cannot occur; the issue-wise majority is protected against defeat. This complements the severity index by establishing consensus thresholds below which paradoxes are impossible, irrespective of topic weight distribution.
7. Computational and Practical Considerations
Both PSI and are simple to compute, scale linearly in the number of tasks/issues, and can be incorporated into automated benchmark reporting. In practice, high index values alert to domains, regimes, or configurations where standard evaluation collapses, motivating further methodological refinement and development of substrate-sensitive assessment protocols. This framework supports transparent, numeric quantification of measurement collapse and sharpens the boundary between biologically-grounded and architecture-native testing in AI and voting systems.