Distributional Shrinkage II: Optimal Transport Denoisers with Higher-Order Scores (2512.09295v1)

Published 10 Dec 2025 in math.ST, cs.LG, and stat.ML

Abstract: We revisit the signal denoising problem through the lens of optimal transport: the goal is to recover an unknown scalar signal distribution $X \sim P$ from noisy observations $Y = X + σZ$, with $Z$ being standard Gaussian independent of $X$ and $σ>0$ a known noise level. Let $Q$ denote the distribution of $Y$. We introduce a hierarchy of denoisers $T_0, T_1, \ldots, T_\infty : \mathbb{R} \to \mathbb{R}$ that are agnostic to the signal distribution $P$, depending only on higher-order score functions of $Q$. Each denoiser $T_K$ is progressively refined using the $(2K-1)$-th order score function of $Q$ at noise resolution $σ^{2K}$, achieving better denoising quality measured by the Wasserstein metric $W(T_K \sharp Q, P)$. The limiting denoiser $T_\infty$ identifies the optimal transport map with $T_\infty \sharp Q = P$. We provide a complete characterization of the combinatorial structure underlying this hierarchy through Bell polynomial recursions, revealing how higher-order score functions encode the optimal transport map for signal denoising. We study two estimation strategies with convergence rates for higher-order scores from i.i.d. samples drawn from $Q$: (i) plug-in estimation via Gaussian kernel smoothing, and (ii) direct estimation via higher-order score matching. This hierarchy of agnostic denoisers opens new perspectives in signal denoising and empirical Bayes.

Summary

The paper introduces a hierarchy of denoisers that bridges the identity map and the optimal transport map using higher-order score expansions.
It leverages explicit Bell polynomial relationships to construct estimable polynomial functions solely from the observable data distribution.
Theoretical guarantees show that as the expansion order increases, the Wasserstein error contracts at a rate of O(eta^(K+1)) under smoothness assumptions.

Distributional Shrinkage and Hierarchical Optimal Transport Denoisers

Problem Definition and Motivations

The paper "Distributional Shrinkage II: Optimal Transport Denoisers with Higher-Order Scores" (2512.09295) addresses signal denoising from a distributional perspective. The setup consists of recovering a scalar signal $X \sim P$ from noisy measurements $Y = X + \sigma Z$ , with $Z \sim \mathcal{N}(0,1)$ and $\sigma>0$ known, where only the distribution $Q$ of $Y$ is observable. The paper frames the goal as distributional recovery: rather than minimizing mean squared error (MSE), focus is placed on distributional proximity between the denoised output and the original distribution $P$ , measured by the Wasserstein distance $W_r$ .

The work motivates the investigation through the inadequacy of classical denoisers—such as the Bayes-optimal and empirical Bayes denoisers—which, while optimal for MSE, induce excessive shrinkage at the distributional level (cf. [liang2025distributional, garcia2024new, jaffe2025constrained]). This motivates a hierarchy of optimal transport (OT) denoisers, constructed to achieve close distributional alignment in Wasserstein distance, with improved performance as more structure is incorporated.

Hierarchy of Distributional Denoisers: Higher-Order Score Expansion

The central contribution is a hierarchy of denoisers, $\{T_K\}_{K=0}^\infty$ , which interpolate between the identity map ( $T_0(y) = y$ ) and the optimal transport map ( $T_\infty$ ), defined via the quantile-matching transformation $T_\infty(y) = F^{-1} \circ G(y)$ (where $F^{-1}$ , $G$ are the quantile and CDF of $P$ and $Q$ , respectively).

Each denoiser $T_K$ is constructed via a truncated series expansion

$T_K(y) = y + \sum_{k=1}^K \frac{\eta^k}{k!} h_k(y), \qquad \eta = \sigma^2/2,$

where the $h_k(y)$ are explicitly defined polynomial functions of higher-order score functions of $Q$ (with $q = dQ/dy$ ). These higher-order scores, recursively structured through partial Bell polynomials, encode information about the series expansion of the optimal transport map.

Crucially, these $h_k$ depend exclusively on $Q$ ; knowledge of $P$ is not required. Explicitly:

$h_1(y) = \frac{q'(y)}{q(y)}$ ,
$h_2(y)$ involves $q''(y)$ , $q'(y)^2$ , and the derivative of $h_1$ ,
Higher orders involve further derivatives and cross-terms, recursively determined via Bell polynomial relations.

This combinatorial structure, characterized in closed form, is a key technical insight, as it links the denoising map to estimable objects from observable data.

Theoretical Guarantees: Distributional Accuracy and Noise Asymptotics

The paper establishes that for each $K$ , $T_K$ achieves Wasserstein approximation to $P$ of order $O(\eta^{K+1})$ , assuming sufficient smoothness and regularity. Specifically, for compact $\text{supp}(P)$ and sufficiently smooth densities,

$W_r ( T_K \sharp Q, \; P ) \lesssim \eta^{K+1},$

and similarly, the sup-norm error between $T_K$ and $T_\infty$ contracts at the same rate as $\sigma \to 0$ . This quantifies a precise tradeoff between the order of expansion and denoising fidelity in Wasserstein space.

In the small-noise limit, higher $K$ results in dramatically improved matching—a property not paralleled by traditional MSE-minimizing denoisers.

Denoiser Estimation: Plug-In and Score Matching Approaches

Implementation of these denoisers from data relies on estimation of higher-order score functions of $Q$ , specifically the ratios $q^{(m)} / q$ .

Two estimation paradigms are studied:

Gaussian Kernel Smoothing: For each derivative $q^{(m)}$ , a plug-in estimator $\hat q^{(m)}$ is constructed from i.i.d. data, with optimal rates of convergence established (e.g., $n^{-4/(2m+5)}$ MSE rate for the $m^{\text{th}}$ derivative at fixed $y$ ).
Higher-Order Score Matching: Direct estimation of the score functions in function space via generalized score matching [hyvarinen2005estimation], with risk rates governed by the Hölder smoothness of the true score function (parametric rate $n^{-1/2}$ possible for sufficient smoothness).

Both approaches yield empirically tractable procedures, with theoretical performance guarantees, for realizing the entire denoiser hierarchy from observed data.

Relation to Prior Literature

The approach fundamentally differs from classical empirical Bayes and shrinkage methods (Stein, James-Stein, etc.), which either assume prior knowledge or estimate the unobserved $P$ (g-modeling) before denoising. The current paper demonstrates that direct modeling and higher-order estimation in the observation space ( $Y$ ) can recover the optimal transport map in distributional metrics without explicit prior estimation.

Prior work (e.g., [liang2025distributional]) analyzed lower-order OT denoisers or imposed structural assumptions on the prior. The present work generalizes the analytic framework to arbitrary order, makes explicit the combinatorics underpinning the construction, and characterizes the statistical estimation landscape for these higher-order objects.

Implications and Future Directions

These results provide a principled methodology for nonparametric, data-driven denoising that explicitly targets distributional reconstruction, with rapidly contracting error in Wasserstein space using only observable statistics.

From a practical standpoint, this framework is highly relevant to modern generative modeling, where distributional metrics (rather than pointwise error) are fundamental—for instance, in diffusion-based models, where denoising score estimators play a central role. The characterizations here suggest a principled route to estimator design with performance guarantees at the optimal transport level.

Theoretically, uncovering the combinatorial Bell polynomial structure behind denoising maps opens directions for further exploration, particularly in higher dimensions (where optimal transport maps are more complex), for other noise models, or in non-Euclidean settings.

Conclusion

The paper introduces and analyzes a hierarchy of distributional denoisers, constructed using higher-order score functions of the observed noisy distribution, with explicit characterization via Bell polynomials and rigorous guarantees on denoising quality in Wasserstein distance. It provides both the mathematical underpinnings and practical estimation tools to achieve distribution-matching denoising, representing a significant extension of distributional shrinkage theory and methodology (2512.09295).