Conditional Hyvarinen Score Differences

Updated 13 November 2025

Conditional Hyvarinen score differences are statistics derived from proper scoring rules that compare the local sharpness of two conditional models without relying on normalizing constants.
They leverage gradients and Laplacians of log-conditional densities to enable robust analysis in scenarios where likelihoods are inaccessible, such as high-dimensional and unnormalized data.
These score differences underpin advanced methods like score-based CUSUM for quickest change detection and Bayesian model selection, providing an effective alternative to traditional likelihood-based approaches.

Conditional Hyvärinen score differences are a class of statistics derived from proper scoring rules that compare the local sharpness of two conditional (or transition) models for possibly high-dimensional or unnormalized data-generating processes. They generalize the paradigm of likelihood-based comparison to contexts where likelihoods are inaccessible, focusing on gradients and Laplacians of log-conditional densities, and form the analytic core of several recent methods for estimation, model comparison, and quickest change detection in both independent and time-series or Markovian settings. Their key property is invariance to unknown normalizing constants, enabling application to unnormalized statistical models and implicit generative mechanisms.

1. Definitions and Mathematical Structure

The conditional Hyvärinen score for a model with one-step predictive (or transition) density $p_\theta(y_t\,|\,y_{1:t-1})$ is defined as

$S_H(p_\theta, y_t\,|\, y_{1:t-1}) = 2\, \Delta_{y_t} \log p_\theta(y_t\,|\, y_{1:t-1}) + \|\nabla_{y_t} \log p_\theta(y_t\,|\, y_{1:t-1})\|^2,$

where $\nabla_{y_t}$ and $\Delta_{y_t}$ denote the gradient and Laplacian with respect to the current observation $y_t$ . For conditional models $p_1$ and $p_2$ , the conditional Hyvärinen score difference at time $t$ is

$\Delta S_H(t) = S_H( p_1, y_t\,|\, y_{1:t-1}) - S_H( p_2, y_t\,|\, y_{1:t-1}) = 2\, \Delta_{y_t} [h_1(y_t) - h_2(y_t)] + (\nabla_{y_t} h_1(y_t) + \nabla_{y_t} h_2(y_t)) \cdot \nabla_{y_t} [h_1(y_t) - h_2(y_t)],$

where $h_j(y_t) = \log p_j(y_t\,|\, y_{1:t-1})$ . For unnormalized models, where $p(y\,|\,x)$ is only available up to an intractable normalizing constant, the score remains computable as it depends solely on derivatives of the log-density (Wu et al., 2023, Chen et al., 6 Nov 2025, Shao et al., 2017).

In the Markovian context, one compares candidate transition kernels $p(y\,|\,x)$ (pre-change) and $q(y\,|\,x)$ (post-change) via the score difference

$s(B_n) = S_H(y,x; p) - S_H(y,x; q), \quad B_n = (x, y),$

with conditional Hyvärinen scores defined with respect to $y$ at $x$ . These differences play roles analogous to log-likelihood ratio increments in classical sequential analysis, but for energy-based or unnormalized models (Chen et al., 6 Nov 2025, Wu et al., 2023).

2. Statistical Interpretation and Information Divergence

The conditional Hyvärinen score is a strictly proper, scale-invariant scoring rule. Its expected value connects to the Fisher divergence (for marginal or joint models) or to the conditional Fisher divergence in the Markov setting: $D_F( p \| q \mid x ) = \mathbb{E}_{y \sim p(\cdot\,|\,x)} \| \nabla_y \log p(y\,|\,x) - \nabla_y \log q(y\,|\,x) \|^2.$ For score differences $s(B_n)$ under the true pre-change kernel $p$ , the mean

$\mathbb{E}_{x \sim \pi_p}[ D_F( p \| q \mid x ) ]$

is negative, while after change it becomes positive under the alternative. This sign-switch ensures that, as in log-likelihood ratio settings, accumulated score-difference statistics naturally drift upward post-change and downward pre-change, a property essential for sequential detection or model comparison (Chen et al., 6 Nov 2025).

In the i.i.d. prediction context, the sample-averaged conditional Hyvärinen score differences converge almost surely to differences in risk divergences $D_\mathcal{H}(p_*,M_j)$ between the true generating measure $p_*$ and competing models, with

$D_\mathcal{H}(p_*,M_j) = \int \left( \frac{\partial}{\partial y} \log p_*(y) - \frac{\partial}{\partial y} \log p_j(y\,|\,\theta_j^*) \right)^2\, p_*(y)\, dy,$

yielding a consistency theorem for model comparison procedures based on summed conditional score differences (Shao et al., 2017).

3. Applications: Change Detection, Model Comparison, and Estimation

Quickest Change Detection

Conditional Hyvärinen score differences underpin the Score-based CUSUM (SCUSUM) algorithm for quickest change detection in both i.i.d. and Markovian regimes, extending classical CUSUM to unnormalized and high-dimensional models. The test statistic recursively accumulates truncated score differences: $W_n = \max\{0, W_{n-1} + \varphi(s(B_n)) \},$ with thresholding to control for unbounded increments. Stopping times are defined as $T = \inf\{ n \geq 1 : W_n \geq b \}$ , where $b$ is selected to control false alarms (Chen et al., 6 Nov 2025, Wu et al., 2023).

Bayesian Model Comparison

Conditional Hyvärinen score sums provide an alternative to Bayes factors, especially when models involve vague priors or intractable likelihoods. For two models with predictive densities $p_1, p_2$ , cumulative score differences across out-of-sample data asymptotically distinguish models by their conditional Hyvärinen risks, even in non-nested settings. Sequential Monte Carlo (SMC) methods can be used to estimate the conditional scores and their differences online, leveraging weighted particle representations of posterior parameters (Shao et al., 2017).

Linear Time Series Estimation

In Gaussian linear processes (AR, MA, ARFIMA), conditional score differences reduce to quadratic functions in prediction errors and can be used for both parameter estimation and comparison: $\Delta S_c( t; \theta, \theta' ) = \left[ -1/\sigma^2 + e_t(\theta)^2 / (2 \sigma^4) \right] - \left[ -1/\sigma'^2 + e_t(\theta')^2 / (2 \sigma'^4) \right],$ where $e_t(\theta)$ is the prediction error under parameter $\theta$ . This enables consistent and (sometimes) fully efficient estimation without recourse to normalizing constants or full likelihoods (Columbu et al., 2019).

4. Computation and Implementation

Score Estimation

Direct computation of conditional scores $\nabla_y \log p(y\,|\,x)$ is generally infeasible in high dimensions or for energy-based models. These gradients are consistently estimated via conditional score matching, where a neural network $\psi(y, x; \theta)$ is trained to minimize

$\widetilde{J}(\theta) = \mathbb{E}_{(x, y) \sim p} \left[ \frac{1}{2} \| \psi(y, x; \theta) \|^2 + \sum_{i=1}^d \frac{\partial}{\partial y_i} \psi_i(y, x; \theta) \right],$

exploiting integration by parts to bypass normalization (Chen et al., 6 Nov 2025).

Sequential Monte Carlo Approximation

For Bayesian model comparison and state-space inference, particle-based SMC approximates the predictive/posterior distributions and their gradients. At each time $t$ , for $N$ weighted samples $\{ \theta_t^{(\ell)}, W_t^{(\ell)} \}$ , the conditional Hyvärinen score is estimated by accumulating weighted Laplacians and gradients of log-predictive densities, applying variance reduction: $\hat{S}_H( p_j, y_t ) \approx 2 \sum_{\ell} W_t^{(\ell)} \Delta_{y_t} \log p_j(y_t\,|\,\theta_t^{(\ell)}) + \sum_{\ell} W_t^{(\ell)} \| \nabla_{y_t} \log p_j(y_t\,|\,\theta_t^{(\ell)}) \|^2 - \left\| \sum_{\ell} W_t^{(\ell)} \nabla_{y_t} \log p_j(y_t\,|\,\theta_t^{(\ell)}) \right\|^2.$ Resulting $\Delta S_H(t)$ are accumulated for testing or model selection (Shao et al., 2017).

Truncation and Boundedness

For proper theoretical guarantees in Markov settings, increments are truncated to ensure boundedness, enabling the use of concentration inequalities (e.g., Hoeffding's inequality for Markov chains) and robustifying finite-sample behavior (Chen et al., 6 Nov 2025).

5. Theoretical Properties

Consistency and Optimality

Under regularity conditions (ergodicity, differentiability, bounded envelope conditions, and posterior concentration), conditional Hyvärinen score-based procedures are consistent: sample averages converge to the true risk differences, and, in change detection, drift direction and detection delay scalings are preserved relative to classical KL-based procedures. For Gaussian models, Fisher and KL divergences coincide, making Hyvärinen and likelihood-based approaches identical in efficiency (Columbu et al., 2019, Wu et al., 2023, Shao et al., 2017).

Control of Error Rates

For change detection, thresholds on accumulated (truncated) scores yield exponential lower bounds for mean time to false alarm and asymptotic upper bounds for detection delay. Under uniform ergodicity (Doeblin’s condition) and light-tail conditions, these bounds ensure rigorous performance guarantees in Markov settings, with explicit expressions for ARL and WADD in terms of the score-difference drift and boundedness constants (Chen et al., 6 Nov 2025).

Extensions to Discrete Data

On discrete state spaces, a finite-difference analogue of the Hyvärinen score (using central and one-sided differences) provides a proper, local, homogeneous score for model comparison, maintaining the same prequential and asymptotic properties as in the continuous setting (Shao et al., 2017).

6. Practical Significance and Scope

Conditional Hyvärinen score differences enable a unified, likelihood-free framework for model comparison, parameter estimation, and sequential detection in high-dimensional, intractable, or unnormalized models. They are well-matched to modern settings such as high-dimensional transition kernels, energy-based generative models, and time-series with latent or implicit dynamics. Key empirical findings include:

SCUSUM achieves false alarm control matching classical CUSUM, with detection delays within a constant factor, outperforming kernel MMD-based tests in high-dimensional, non-Gaussian, and unnormalized regimes (Wu et al., 2023).
In linear-Gaussian models, conditional Hyvärinen estimators are equivalent to OLS/conditional MLE in AR and nearly as efficient in ARFIMA and MA models without requiring full covariance evaluation (Columbu et al., 2019).
SMC-based estimation of score differences is feasible for both tractable and intractable models, with demonstrated model selection consistency in large samples (Shao et al., 2017).

A plausible implication is expanding application domains for sequential analysis, Bayesian comparison, and score-based learning to settings previously inaccessible due to normalization or likelihood challenges.