Papers
Topics
Authors
Recent
Search
2000 character limit reached

Paradox Severity Indexing

Updated 30 November 2025
  • Paradox severity indexing is a quantitative framework that measures the magnitude and impact of paradoxes in multivariate evaluation systems.
  • In LLM evaluation, the Paradox Severity Index quantifies divergence between judge-based conceptual accuracy and binary scoring, highlighting cases of measurement failure.
  • By extending to weighted multi-issue voting, the index exposes methodological incoherence and enables transparent cross-domain comparisons.

Paradox severity indexing is a set of quantitative frameworks designed to measure the magnitude and impact of paradoxes arising in multivariate evaluation systems, particularly when different scoring modalities or aggregative schemes yield conflicting notions of “consensus” or correctness. This approach is especially salient in domains where classical test theory, judgment-based scoring, and exact-match metrics interact nontrivially, such as AI benchmark evaluation and multi-issue social choice. Paradox severity indices provide tight, interpretable numeric bounds on the worst-case discrepancy between intuitive, composite, or majority-based outcomes and those preferred under alternate or theoretically “natural” regimes, highlighting regions of methodological incoherence and enabling transparent cross-model and cross-domain comparisons.

1. Paradox Severity Index in LLM Evaluation

The Paradox Severity Index (PSI), as introduced in "The Catastrophic Paradox of Human Cognitive Frameworks in LLM Evaluation" (Reddy, 23 Nov 2025), quantifies the extent to which traditional binary scoring regimes diverge from judge-based conceptual accuracy in the evaluation of frontier LLMs. Specifically, PSI up-weights this divergence by the model’s Classical Test Theory (CTT)–scaled IQ, exposing cases where higher measured intelligence coincides with catastrophic measurement failure.

Formally, for model ii: PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100} where

  • JudgeAcci\mathrm{JudgeAcc}_i is the mean LLM-as-judge conceptual accuracy,
  • BinaryAcci\mathrm{BinaryAcc}_i is the mean exact-match binary accuracy across the same items,
  • IQCTT,i\mathrm{IQ}_{\mathrm{CTT},i} is the model’s scaled IQ score.

A worked example demonstrates the index:

  • If JudgeAccX=0.48\mathrm{JudgeAcc}_X = 0.48, BinaryAccX=1.00\mathrm{BinaryAcc}_X = 1.00, IQX=100\mathrm{IQ}_X = 100,
  • The raw gap is 0.481.00=0.52|0.48 - 1.00| = 0.52,
  • PSIX=0.52×1.0=0.52_X = 0.52 \times 1.0 = 0.52.

Empirical values (Table 7, (Reddy, 23 Nov 2025)) range from 0.39 to 0.60 across nine state-of-the-art LLMs; higher PSI indicates greater paradoxical misalignment between individually plausible metrics.

2. Severity Indexing in Weighted Multi-Issue Voting

In multi-issue collective decision making, paradox severity indexing addresses the systematic divergence between issue-wise majority and majority-supported proposals in the context of weighted binary voting, including Anscombe’s and Ostrogorski’s paradoxes. The critical parameter is the maximum average topic weight, denoted PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}0, representing the highest concentration of voter weight across topics.

For PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}1 voters over PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}2 binary issues, each with a unit-sum weight vector PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}3: PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}4

The worst-case distance from issue-wise majority PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}5 over all cases with PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}6 is bounded piecewise: PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}7 This bound is tight for a dense set of PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}8 and all PSIi=JudgeAcciBinaryAcci×IQCTT,i100\mathrm{PSI}_i = |\mathrm{JudgeAcc}_i - \mathrm{BinaryAcc}_i| \times \frac{\mathrm{IQ}_{\mathrm{CTT},i}}{100}9 (Baharav et al., 20 Feb 2025).

3. Interpretation Bands and Heuristics

In LLM evaluation (Reddy, 23 Nov 2025), PSI bands are heuristically interpreted:

  • PSI JudgeAcci\mathrm{JudgeAcc}_i0: Mild paradox—judge and binary scores largely agree.
  • JudgeAcci\mathrm{JudgeAcc}_i1 PSI JudgeAcci\mathrm{JudgeAcc}_i2: Moderate paradox—substantial misalignment.
  • PSI JudgeAcci\mathrm{JudgeAcc}_i3: Severe paradox—catastrophic measurement failure.

In weighted voting (Baharav et al., 20 Feb 2025), JudgeAcci\mathrm{JudgeAcc}_i4 quantifies severity: small JudgeAcci\mathrm{JudgeAcc}_i5 corresponds to mild paradoxes (per issue consensus aligns closely with proposal consensus), while large JudgeAcci\mathrm{JudgeAcc}_i6 allows paradox-induced deviations up to full disagreement.

4. Complementarity with Item Response Theory and Judge Validation

PSI supplements latent ability modeling (2PL IRT; JudgeAcci\mathrm{JudgeAcc}_i7 for ability, JudgeAcci\mathrm{JudgeAcc}_i8 for difficulty, JudgeAcci\mathrm{JudgeAcc}_i9 for discrimination), enabling a dual-axis analysis: ability and paradoxical misalignment. Judge-vendor validation—using rubric-based, cross-vendor LLM-as-Judge protocols—guarantees that BinaryAcci\mathrm{BinaryAcc}_i0 isolates conceptual correctness robustly. PSI’s interpretability is conditional on rigorous conceptual scoring; absence of such undermines reliability.

In voting, BinaryAcci\mathrm{BinaryAcc}_i1 is computed in BinaryAcci\mathrm{BinaryAcc}_i2 time and flags “dangerous” distributions: high concentrations of topic weight induce severe paradoxes, while evenly distributed weights align with classical consensus bounds.

5. Generalization to Other AI and Social Choice Evaluation Domains

Paradox severity indexing is extensible wherever two qualitatively distinct scoring methods (one conceptually focused and validated) exhibit systematic divergence, and where a normative scaling factor (IQ, ELO, percentile) is available. Example applications:

  • Computer vision: pixel-level matches vs. human annotation, weighted by validated gold-standard accuracy.
  • Dialogue systems: automated BLEU vs. human perception, possibly multiplied by fluency or coherence norms.
  • Robotics: sensor-driven success rates vs. expert evaluations, up-weighted by capability scores.

Necessary conditions for generalization: (a) Two complementary, domain-relevant scoring methods, (b) Reliable conceptual (often human or judge-based) scoring, (c) Anchoring by a cross-system scale where larger model-level gaps induce higher index values.

This suggests paradox severity indices can systematically expose architecture-specific failures in metric validity, facilitating domain-aware, paradigm-shifting evaluation frameworks.

6. Wagner’s Rule and Sufficient Paradox-Preclusion Conditions

A consequence in weighted multi-issue voting (Baharav et al., 20 Feb 2025) is a sufficient bound for paradox avoidance: If the average majority BinaryAcci\mathrm{BinaryAcc}_i3 (sum of issue-wise majorities weighted by BinaryAcci\mathrm{BinaryAcc}_i4 and normalized), then Anscombe's paradox cannot occur; the issue-wise majority is protected against defeat. This complements the severity index by establishing consensus thresholds below which paradoxes are impossible, irrespective of topic weight distribution.

7. Computational and Practical Considerations

Both PSI and BinaryAcci\mathrm{BinaryAcc}_i5 are simple to compute, scale linearly in the number of tasks/issues, and can be incorporated into automated benchmark reporting. In practice, high index values alert to domains, regimes, or configurations where standard evaluation collapses, motivating further methodological refinement and development of substrate-sensitive assessment protocols. This framework supports transparent, numeric quantification of measurement collapse and sharpens the boundary between biologically-grounded and architecture-native testing in AI and voting systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Paradox Severity Indexing.