Falsifiability Score

Updated 26 January 2026

Falsifiability score is a normalized real value that measures a hypothesis' vulnerability to empirical refutation, combining logical and quantitative criteria.
Methodologies such as optimal transport, Bayesian evidence, and risk-based tests provide diverse tools to compute and compare these scores.
Applications span economics, machine learning, and philosophy, offering actionable insights into model adequacy, risk assessment, and testability.

A falsifiability score is a quantitative measure assigning a real value—often normalized to [0,1]—that reflects the susceptibility of a hypothesis, model, or theory to empirical refutation. Modern treatments of falsifiability, originating from Popper's demarcation criterion, seek continuous indices that are operationally tractable and compatible with statistical, information-theoretic, combinatorial, and learning-theoretic frameworks. Multiple formalisms have been proposed across economics, statistics, machine learning, and philosophy of science, incorporating both short-run and long-run has. severity, model risk, and testability.

1. Theoretical Foundations and Motivation

Falsifiability, in the Popperian sense, refers to the vulnerability of a theory to being empirically contradicted. Classical treatments considered falsifiability a binary, logical property. However, real scientific inference requires grades of falsifiability, especially in the context of statistical models, incomplete identification, and noisy or limited data. The move toward falsifiability scores aims to:

Quantify the "riskiness" or logical boldness of statements (i.e., informativeness, content).
Provide operational measures to compare models or hypotheses, including under model misspecification.
Enable statistical or combinatorial analysis of the testability of complex, probabilistic, or high-dimensional hypotheses (Nemenman, 2015, Vignero et al., 2021, René et al., 2024, Dale, 2022).

2. Key Formalisms of the Falsifiability Score

Distinct but interrelated approaches have emerged, including:

A. Optimal Transport-Based Score for Economic Models

For incompletely specified structural models, the falsifiability score S(P,ν) is defined as:

$S(P,ν) = \inf_{\pi\in\mathcal{M}(P,ν)} \int 1_{y\notin G(u)}\,d\pi(y,u) = \sup_{A\subset{\mathcal{Y}}} \big[ P(A) - ν(G^{-1}(A)) \big]$

$P$ : empirical law of observables $Y$ , $ν$ : prescribed or partially identified latent distribution, $G$ : structural correspondence, $\mathcal{M}(P,ν)$ : couplings with prescribed marginals.
$S(P,ν)=0$ iff the model is not falsified; $S(P,ν)>0$ quantifies the minimal fraction of $P$ -mass unexplained by any admissible joint respecting $G$ (Ekeland et al., 2021).

B. Model-Credibility Index as Falsifiability Score

Lindsay–Liu’s model-credibility index, $N^*$ , operationalizes falsifiability as the smallest sample size at which a level- $\alpha$ test rejects the postulated model $\mathcal{M}$ against the true distribution $\tau$ with at least 50% probability:

$N^*(\tau,\mathcal{M}) = \min\{ m \mid \beta_\tau(m) \geq 0.5 \}$

$N^*$ inversely reflects falsifiability: lower $N^*$ means higher falsifiability (the model is easily refuted), $N^*=\infty$ if no finite sample can reveal a discrepancy (Lindsay et al., 2010).

C. Bayesian Evidence as Falsifiability Score

Within Bayesian model selection, the log marginal likelihood (model evidence) embodies a graded falsifiability score:

$\text{Falsifiability}(M;D) = \log P(D|M)$

where $P(D|M)$ is the integrated likelihood over model parameters. Highly falsifiable models allocate vanishing probability to most data, leading to higher evidence for data they explain and severe penalization by the built-in Occam factor (Nemenman, 2015).

D. Nonparametric Risk-Based and Reproducibility Scores

Recent work ties falsification to the distribution of empirical risk, with explicit uncertainty quantification. A model $m$ 's falsifiability score is defined as:

$S(m) = \sup_{n\ne m} P(R(n) < R(m))$

where $R(m)$ is a nonparametrically estimated risk distribution over hypothetical datasets. A high $S(m)$ implies that $m$ is reliably outperformed (i.e., is empirically falsified) (René et al., 2024).

E. Kullback-Leibler Severity-Based Score

The expected disconfirmation of an experiment $E$ on a probabilistic theory $\tau$ is:

$⟨d_E(\tau)⟩ = D_{KL}(P_E \parallel P_E(\cdot|\tau))$

Supremizing over experiments yields a falsifiability score $F(\tau)$ . High scores correspond to theories that can in principle be outright refuted by some test (Vignero et al., 2021).

F. Unified Topological-Combinatorial Score

A combined score $F(H)$ for a hypothesis $H \subset 2^X$ :

$F(H) = L(H) \cdot g(VC(H))$

$L(H)=1$ if $H$ is nowhere dense (always falsifiable), $0$ otherwise.
$g$ a decaying function of VC dimension, e.g., $g(d) = e^{-d}$ , gives finer grading based on "surprise"/shattering properties (Dale, 2022).

3. Practical Computation and Statistical Testing

Computation strategies are tailored to the formalism and data regime:

Optimal transport scores are computed via linear programming or convex duality (network simplex, Hungarian algorithm) in discrete cases; in continuous domains, empirical processes and sample-splitting with bootstrapped quantile estimation are employed (Ekeland et al., 2021).
Model-credibility indices are estimated by fixed-level hypothesis testing and subsampling or bootstrap procedures to approximate power at various sample sizes (Lindsay et al., 2010).
Nonparametric risk-based scores employ empirical quantile functions, hierarchical-beta processes for uncertainty distributions, and pairwise cross-model risk comparisons. Empirical reproducibility criteria mirror experimental practice, demanding that falsification results persist under data perturbations (René et al., 2024).
Combinatorial scores require calculation of VC dimension (via shattering analysis) and topological analysis (nowhere denseness) for the hypothesis class, leveraging Sauer–Shelah bounds for finite-sample properties (Dale, 2022).

4. Interpretation, Examples, and Properties

Falsifiability scores serve as nuanced metrics bridging logical, statistical, and experimental dimensions. Key interpretive summaries include:

Fraction-of-mass explanation: In optimal transport, $S(P,ν)$ gives the minimal $P$ -mass not constructible from any joint respecting the structural restrictions (Ekeland et al., 2021).
Sample size to detection: $N^*$ yields the largest $m$ at which a model remains nearly indistinguishable from the true process; lower $N^*$ implies greater falsifiability (Lindsay et al., 2010).
Log-evidence and Occam's Razor: Bayesian falsifiability is inherently tied to the extent a model allocates measure to possible data, penalizing excessive flexibility and rewarding sharp predictions (Nemenman, 2015).
Risk-based screening: High $S(m)$ indicates empirical rejection in favor of some competitor across experimental contexts; values near unity signal effective falsification (René et al., 2024).
Kullback-Leibler divergence: Infinite expected disconfirmation (severity) equates to outright test-based falsification; positive finite values reflect graded potential for refutation (Vignero et al., 2021).
Unification of short- and long-run testability: The score $F(H)$ synthesizes nowhere denseness (long-run) and shatterability/VC-dimension (short-run) into a single index of testability (Dale, 2022).

Illustrative examples include model selection via marginal likelihoods in coin-flip problems (Bayesian score), combinatorial analysis of halfspaces or thresholds (topological-combinatorial), and optimal transport violations in economic datasets.

5. Connections to Model Adequacy, Riskiness, and Verisimilitude

Falsifiability scores provide a fundamental quantitative tool for model comparison, assessment, and selection, grounding Popperian "riskiness" and informativeness in measurable quantities. Informativeness typically correlates with higher falsifiability scores, as more informative models exclude more potential outcomes or data. There is a direct link between high falsifiability and high expected truth content: within a true model class, those models that are more falsifiable (hence riskier) are also more truthlike, ensuring rapid elimination of false alternatives and convergence toward the true process (Vignero et al., 2021, Dale, 2022).

6. Limitations, Parameters, and Operational Considerations

The interpretation of scores depends on the admissible experiments or data-generating variations (class $\mathcal{E}$ ).
Bayesian falsifiability and evidence are sensitive to prior specification and parameterization.
Empirical falsifiability can be overconfident if statistical noise or model misspecification are not addressed.
Topological and combinatorial scores require careful definition of hypothesis classes and their measurable structure.
In some frameworks, normalization or bounded versions of scores are preferred for comparability ( $[0,1]$ scaling).

Practical implementations are available in standard statistical and computational languages, and growing toolkits (e.g., Python packages) target nonparametric versions for fitted models (René et al., 2024).

7. Unified Table of Falsifiability Score Formalisms

Formalism	Definition / Formula	Reference
Optimal transport	$S(P,ν)=\inf_{\pi\in\mathcal{M}(P,ν)}\int 1_{y\notin G(u)}d\pi$	(Ekeland et al., 2021)
Model-credibility index	$N^*(\tau,\mathcal{M})=\min\{m:\beta_\tau(m)\geq0.5\}$	(Lindsay et al., 2010)
Bayesian log-evidence	$\log P(D\|M)$ or $\log B_{M,U}(D)$	(Nemenman, 2015)
Nonparametric risk score	$S(m)=\sup_{n\ne m}P(R(n)<R(m))$	(René et al., 2024)
KL severity score	$F(\tau)=\sup_E D_{KL}(P_E \parallel P_E(\cdot\|\tau))$	(Vignero et al., 2021)
Topo-VC unified score	$F(H)=L(H)g(VC(H))$ ( $L$ =long-run, $g$ =decay fn)	(Dale, 2022)