Cross-Model Disagreement Measurement

Updated 15 October 2025

The paper demonstrates that pairwise model disagreement, quantified as the fraction of differing predictions, serves as a practical proxy for hidden model error.
It leverages divergence metrics like Hellinger Distance, Jensen-Shannon Divergence, and Feature Agreement to relate calibration, generalization, and uncertainty.
Empirical results show a linear relation between disagreement and test error, enabling label-free performance estimation under distribution shifts.

Cross-model disagreement measurement refers to the formal quantification and analysis of prediction divergence among independently trained models, either via stochastic optimization (e.g., SGD), distinct architectures, manipulated initializations, or application of different interpretability techniques. Across diverse lines of recent research, measuring such disagreement has emerged both as a practical tool for estimating hidden model error—especially under distribution shift or data scarcity—and as a conceptual bridge connecting generalization, calibration, uncertainty quantification, and interpretability.

1. Mathematical Definition and Foundational Principles

Disagreement is most commonly defined for a pair of independently sampled models $h, h'$ on data distribution $D$ as the expected fraction of points with differing predictions: $\mathrm{Dis}_D(h, h') = \mathbb{E}_{X \sim D}[\mathbf{1}\{h(X) \neq h'(X)\}].$ For classification with probabilistic outputs, distributional disagreement can further be quantified using divergence scores:

Hellinger Distance: $D^{HD}(f, f') = \frac{1}{\sqrt{2}} \sqrt{\sum_k (\sqrt{f_k} - \sqrt{f'_k})^2}$
Jensen-Shannon Divergence: $D^{JSD}(f, f') = \frac{1}{2} [KL(f \parallel m) + KL(f' \parallel m)]$ , $m = \frac{1}{2}(f + f')$
Symmetrized KL: $D^{KLD}(f, f') = \frac{1}{2} [KL(f \parallel f') + KL(f' \parallel f)]$

For explanation methods, disagreement can be formalized via metrics including Feature Agreement, Rank Agreement, Sign Agreement, and Signed Rank Agreement, e.g.: $\mathrm{FeatureAgreement}(E_a, E_b, k) = \frac{|\text{top}_k(E_a) \cap \text{top}_k(E_b)|}{k}$

Underlying these approaches is the principle that the diversity in predictions—caused by training stochasticity, model architecture, or post-hoc explanation methods—can be harnessed as a tractable, predictive, and often unsupervised signal of model uncertainty and generalization error (Jiang et al., 2021, Schirmer et al., 2023).

2. Empirical and Theoretical Links to Generalization

"Assessing Generalization of SGD via Disagreement" (Jiang et al., 2021) establishes the Generalization Disagreement Equality (GDE): for well-calibrated ensembles of SGD-trained neural nets, the expected test error equals the expected pairwise disagreement on unlabeled data,

$\mathbb{E}_{h, h' \sim H}[\mathrm{Dis}_D(h, h')] = \mathbb{E}_{h \sim H}[\mathrm{TestErr}_D(h)].$

In binary classification with ensemble confidence $q$ , both disagreement and expected error are $2q(1-q)$ per instance. Extensive experiments confirm this correspondence across architectures, datasets, and stochastic training regimes. The theory shows calibration is the critical assumption: if the ensemble's confidence matches the true conditional probability (i.e., is well-calibrated), the empirical disagreement rate robustly tracks the generalization error even without test labels.

The "disagreement-on-the-line" phenomenon (Lee et al., 2023) reveals that in high-dimensional random features regression under domain shift, target-domain disagreement and in-distribution disagreement (or error) lie on a tight linear relation: measuring source-domain disagreement enables data-free forecasts of target error through explicit formulas involving source and target domain statistics.

3. Applications in Uncertainty Estimation, OOD, and Explanation Consistency

Disagreement measurement is now standard in several unsupervised performance estimation schemes:

Forecasting under Distribution Shift: By aligning in-distribution and out-of-distribution (OOD) disagreement via linear or divergence-based models, practitioners can estimate OOD accuracy without labels (Schirmer et al., 2023, Lee et al., 2023, Mishra et al., 17 Jun 2025).
Detection of Ambiguous or Challenging Instances: In multimodal systems, cases where unimodal and fused predictions disagree indicate underlying ambiguity, validated by decomposed human annotator uncertainty (e.g., Cohen’s kappa decrease) (Srikanth et al., 20 May 2025).
Explainability and Model Interpretation: Quantitative frameworks for explanation disagreement (feature, rank, sign, and their aggregation) show pervasive inconsistency between post-hoc methods, signaling the necessity of multi-method interpretability pipelines (Krishna et al., 2022).

The disagreement between models may also serve for active sample selection in semi-supervised learning and domain adaptation, as in classifier-disagreement-based self-training, which improves cross-domain adaptation and class-level alignment by focusing learning on samples where domain-specific predictors disagree (Sun et al., 2023).

4. Socio-Technical and Human-Centered Extensions

Disagreement is increasingly analyzed not only between models but also among human raters/annotators. The GRASP framework (Prabhakaran et al., 2023) quantifies in-group (CI) and cross-group (CX) cohesion among annotated ratings, combining these into a Group Association Index (GAI) to detect systematic demographic differences in subjective annotation. This paradigm is directly translatable for cross-model assessment: "models as annotators" approaches ensemble neural outputs as virtual human responses, with the spread (variance or standard deviation) of model scores capturing uncertainty that aligns with human annotator disagreement (Liu et al., 19 Nov 2024).

Instance-level calibration measures—entropy, ranking, TV distance (DistCE)—allow fine-grained alignment between model predictions and the full distribution of human judgments, overcoming the limitations of majority-vote-based calibration in inherently ambiguous settings (Baan et al., 2022).

5. Formal Models for Measuring and Comparing Disagreement

Several core formulas and statistical models systematize disagreement analysis:

Metric/Formula	Definition	Context
$\mathrm{Dis}_D(h, h')$	$\mathbb{E}_D[\mathbf{1}\{h(X)\ne h'(X)\}]$	Classification disagreement
$\mathrm{TestErr}_D(h)$	$\mathbb{E}_D[\mathbf{1}\{h(X)\ne Y\}]$	Prediction error
$2q(1-q)$	Expected disagreement/error given ensemble confidence $q$ (binary case)	Calibration/generalization
$\mathrm{FeatureAgreement}(E_a, E_b, k)$	$\|\text{top}_k(E_a) \cap \text{top}_k(E_b)\| / k$	Explainability
$\mathrm{EntCE}(x)$	$H(f(x)) - H(\bar{\pi}(x))$	Instance-level calibration
$\mathrm{DistCE}(x)$	$TVD(f(x), \bar{\pi}(x)) = \frac{1}{2} \\|f(x) - \bar{\pi}(x)\\|_1$	Instance-level calibration
GAI( $\Pi$ )	$CI(YZ[\Pi]) / CX(YZ[\Pi], YZ[\neg\Pi])$	Group-level disagreement

These metrics enable rigorous quantification of disagreement both across models and between models and human annotators.

6. Structural and Systematic Nature of Disagreement

Recent empirical work (Ingram et al., 2 Jul 2025) demonstrates that cross-model disagreement in LLM relevance filtering exhibits non-random structure: disagreement cases can be classified with high AUC using simple lexical features, are associated with statistically significant term co-occurrences, and propagate to divergent ranked retrieval results even under identical scoring logic. This systematic divergence, rather than random noise, signals model-specific biases and structured variability in downstream applications.

Similarly, in meta-evaluation of LLM misgendering (Subramonian et al., 23 Apr 2025), generation-based and probability-based evaluation methods disagree on up to 20% of instances, with disagreement especially pronounced for rare pronouns and aligned with human annotator ambiguity. These discrepancies call into question the convergent validity of automatic metrics and recommend context-specific evaluation protocols.

7. Practical Implications, Limitations, and Future Directions

Disagreement measurement is a practical, label-free estimator of model error, calibration, and uncertainty. It is sensitive to ensemble calibration, the stochasticity of training, domain divergence, and interpretability. Limitations arise if the calibration assumption fails: uncalibrated ensembles may invalidate the test error/disagreement equality. High-dimensional and multimodal settings further challenge conventional metrics, making divergence-based and variance-based methodologies preferable.

The probabilistic and spectral frameworks underlying disagreement analysis enable its extension to group dynamics, structured retrieval, and social annotation, with increasing use for active learning, robust model selection, and targeted intervention design. Future research is directed at unifying disagreement metrics for interpretable, multimodal, and human-in-the-loop systems; formally analyzing universal properties across architectures and domains; and integrating cross-model disagreement into risk-sensitive and fairness-aware machine learning pipelines.

Cross-model disagreement measurement thus serves both as a critical diagnostic of machine learning system reliability and as a foundation for robust, transparent model evaluation, interpretability, and deployment in complex environments.