Score-Comparison Inconsistency

Updated 28 September 2025

Score-Comparison Inconsistency is the phenomenon where comparing scores leads to unreliable conclusions due to deviations from theoretical assumptions and aggregation effects.
Methodologies based on pairwise and ordinal comparisons reveal that inconsistency can be quantitatively bounded using axiomatic indices and adjustments in aggregation and scale.
Practical implications span automated model evaluations, including LLM benchmarking, where best practices involve order-sensitive scoring, normalized metrics, and rigorous consistency testing.

Score-Comparison Inconsistency is a broad class of phenomena wherein the comparison of scores—whether model outputs, algorithm evaluations, or subjective assessments—leads to unreliable, ill-posed, or contradictory conclusions due to underlying deviations from theoretical assumptions, information loss, or context-dependent ambiguities. This inconsistency manifests in domains ranging from pairwise comparison matrices in decision analysis to modern automated model evaluation with LLMs, and is deeply tied to the properties of the scoring functions, aggregation mechanisms, and evaluation protocols employed.

1. Foundations in Pairwise Comparison Matrices

Early formalizations of score-comparison inconsistency arise in the context of pairwise comparison (PC) matrices, where alternatives are rated relative to each other via multiplicative or ordinal comparisons. The archetypal source is the violation of consistency in triads: for a 3×3 submatrix with entries (x, y, z), true consistency requires $y = x \cdot z$ . The axiomatization of inconsistency indicators in triads (Koczkodaj et al., 2015) imposes three core requirements:

Zero Inconsistency at Consistency: $ii(x, y, z) = 0$ if $y = x \cdot z$ .
Boundedness: $ii(x, y, z) \in [0, 1)$ for positive $x, y, z$ .
Monotonicity: Deviation from the consistency condition monotonically increases $ii$ .

A mathematically induced “triad deviation” function, e.g., $td(x, y, z) = d(xz, y)$ for a metric $d$ , provides the theoretical basis for inconsistency measurement, with examples including

$Kii(a, b, c) = 1 - \exp(-|\ln(b/(a \cdot c))|).$

A key result is that, even on this simplest non-trivial structure, acceptable inconsistency indices are sharply constrained: extensions of the axiomatic systems (adding homogeneity and scale invariance (Csató, 2018)) uniquely determine the ranking of inconsistency among all possible indices, essentially reducing any reasonable measure to a function of the deviation from multiplicative transitivity. This characterization underpins much of modern decision analytic consistency checking.

2. Quantitative Manifestations and Aggregation Effects

Score-comparison inconsistency is exacerbated in the aggregation of local inconsistency (e.g., among triads or cycles) to global rankings or priority vectors. Work comparing principal eigenvector and geometric mean methods (Kułakowski et al., 2020) reveals that, even when global indices such as Saaty’s Consistency Index (CI) or Koczkodaj’s Index (KI) are low, aggregation methods can produce nontrivial differences in output rankings, with these differences provably bounded in terms of inconsistency:

$\kappa^2 \leq \beta_{ij} \equiv \frac{w_{ev}(a_i)}{w_{ev}(a_j)} \frac{w_{gm}(a_j)}{w_{gm}(a_i)} \leq 1/\kappa^2, \quad (\kappa = 1-\operatorname{KI}(C)),$

and with total deviations bounded by $n(\kappa^2 - 1)$ and $n(1/\kappa^2 - 1)$ . As inconsistency increases, method-dependent artifacts and divergence in output rankings become pronounced.

The impact of scale is also critical: restricting PC entries to a 1–3 scale (rather than 1–5 or 1–9) renders the feasible region of the least-squares approximation problem convex, guaranteeing unique, stable weights and minimizing the magnitude of inconsistency (Fueloep et al., 2015), thus directly mitigating score-comparison inconsistencies.

3. Inconsistency in Ordinal and Incomplete Comparisons

Ordinal pairwise comparisons further highlight the complexity of score-comparison inconsistency, especially when ties are permitted (Kułakowski, 2017). The combinatorial structure of possible triads broadens, and the generalization of the Kendall-Babington Smith index reveals that maximal inconsistency is achieved in specific graph constructions (double tournaments), equating the worst-case maximally inconsistent arrangement with a set cover problem, an NP-complete domain.

For incomplete PC matrices, direct computation of global inconsistency indices is undermined by missing data. Adaptations extend local inconsistency to cycles of arbitrary length in the underlying graph (Kułakowski et al., 2019), with matrix-based average cycle-based indices (e.g., $\tilde{I}_1$ ) exhibiting superior robustness compared to ranking-based alternatives. Attempts to generalize global acceptability thresholds, such as Saaty’s $CR \leq 0.1$ , require recalibration of the random index $RI_{n,m}$ as a function of both matrix size and missing entries (Ágoston et al., 2021). Without such adjustment, naive interpretations of score-based rankings in incomplete contexts are misleading.

4. The Role of Scoring Functions: Order-Sensitivity and Equivariance

The mathematical structure of scoring functions fundamentally dictates when score-comparison inconsistency can be avoided. In forecast evaluation, strict consistency alone is insufficient; a scoring rule must be order-sensitive (smaller errors always lead to strictly better scores) and equivariant (invariant to affine transformations matching the behavior of the elicited functional) (Fissler et al., 2017):

Order-Sensitivity: For a norm $\|\cdot\|$ , if $\|x - T(F)\| < \|z - T(F)\|$ , then $\overline{S}(x, F) < \overline{S}(z, F)$ .
Equivariance: For translation invariance, $S(x-c, y-c) - S(x'-c, y-c) = S(x, y) - S(x', y)$ .

Without these properties, scoring functions may reward forecasts that are further from the true functional, or generate rankings sensitive to the units of measurement—both pathways to inconsistency.

5. Inconsistency in Automated Model and LLM Evaluation

In present-day model benchmarking, especially with LLMs and complex generation tasks, score-comparison inconsistency arises both from ill-posed extraction and from aggregation:

LLM-as-a-Judge Paradigms: TrustJudge (Wang et al., 25 Sep 2025) exposes that distributional compression in discrete ratings and tie ambiguities produce significant inconsistencies, e.g., situations where a lower-rated response can outperform a higher-rated one in direct comparison. TrustJudge’s distribution-sensitive scoring replaces modal aggregation with continuous expectation over fine-grained scores—formulated as

$S = \sum_{j=s'_{min}}^{s'_{max}} s'_j \cdot P(s'_j | R),$

thereby reflecting the entropy of the judge’s belief distribution and preserving discriminative information lost in single-score mappings. Pairwise aggregation via bidirectional probabilities or perplexity-based selection further suppresses cycle transitivity violations.

Multiple-Choice QA and Extraction Ambiguity: In MCQA for LLMs (Molfese et al., 19 Mar 2025), evaluation scores are highly sensitive to answer extraction protocols—RegEx, logprobs, and even LLM-based extractors may fail with free-form or chain-of-thought outputs, leading to systematic misalignments relative to human judgment. There exists a fundamental trade-off: prompt constraints that aid extraction erode model reasoning flexibilities, whereas unconstrained outputs exacerbate extractor failures, heightening score-comparison inconsistency at scale.
Global Versus Pairwise Scoring in Leaderboards: In NLP leaderboard settings, global scores often mask rare but significant errors, and pairwise models such as Bradley-Terry (Levtsov et al., 2 Jul 2025) may surface “hidden contenders,” but can themselves become unstable under tie-heavy or highly similar outputs. Qualitative differences between global and pairwise scores are nontrivial: pairwise ordering can diverge sharply from global rankings under edge-case distributions of decision values or under manipulation of confidence.

6. Best Practices, Quantitative Tests, and Practical Impact

Across domains, rigorous approaches have emerged to mitigate or at least identify score-comparison inconsistency:

Axiomatic Indices: Indices satisfying boundedness, monotonicity, permutation invariance, scale invariance, and homogeneous treatment—such as Koczkodaj’s index, RIC, and others—provide interpretable, robust quantitative frameworks (Koczkodaj et al., 2015, Mazurek, 2017).
Numerical Consistency Testing: In binary classification score reporting, direct feasibility checks of metric tuples against underlying confusion matrix constraints (via deterministic inversion, interval arithmetic, and linear programming (Fazekas et al., 2023)) allow for the detection of impossible score configurations, averting the inclusion of numerically inconsistent results in benchmarks, meta-analyses, or medical research.
Adjusted and Normalized Metrics: For entity alignment and link prediction, mean rank and Hits@k scores are rendered cross-dataset comparable only through normalization to account for candidate set size, e.g.,

$AMRI = 1 - \frac{MR - 1}{E[MR - 1]},$

as direct MR values are otherwise uninformative across varying contexts (Berrendorf et al., 2020).

Aggregation over Score Distributions: In stochastic model training, only methods that compare distributions of scores (e.g., via mean difference significance tests or probability of outperforming frameworks) yield statistically reliable superiority conclusions (Reimers et al., 2018). Single-score or “best-run” comparisons exhibit high type-I errors due to random and selection-induced variance.

7. Implications and Future Directions

Score-comparison inconsistency is a ubiquitous threat to the interpretability, reliability, and fairness of data-driven decision-making, ranking, and model evaluation. Its root causes span mathematical, algorithmic, and structural elements of evaluation design: scale choice, aggregation, information compression, tie-handling, and the geometry of the metric space underlying the scoring system. Best practices universally stress the use of bounded, monotonic, scale-invariant indices, transparent normalization and aggregation, deterministic consistency checks, and careful alignment of scoring function properties with the theoretical and practical needs of the domain.

Emergent approaches, such as distribution-sensitive and likelihood-aware aggregation in automated evaluation, and continued theoretical analysis of ordinal, incomplete, and probabilistic scoring systems, are vital for maintaining trustworthy assessment pipelines in increasingly complex and black-box settings.