Papers
Topics
Authors
Recent
2000 character limit reached

Distributional Shrinkage II: Optimal Transport Denoisers with Higher-Order Scores (2512.09295v1)

Published 10 Dec 2025 in math.ST, cs.LG, and stat.ML

Abstract: We revisit the signal denoising problem through the lens of optimal transport: the goal is to recover an unknown scalar signal distribution $X \sim P$ from noisy observations $Y = X + σZ$, with $Z$ being standard Gaussian independent of $X$ and $σ>0$ a known noise level. Let $Q$ denote the distribution of $Y$. We introduce a hierarchy of denoisers $T_0, T_1, \ldots, T_\infty : \mathbb{R} \to \mathbb{R}$ that are agnostic to the signal distribution $P$, depending only on higher-order score functions of $Q$. Each denoiser $T_K$ is progressively refined using the $(2K-1)$-th order score function of $Q$ at noise resolution $σ{2K}$, achieving better denoising quality measured by the Wasserstein metric $W(T_K \sharp Q, P)$. The limiting denoiser $T_\infty$ identifies the optimal transport map with $T_\infty \sharp Q = P$. We provide a complete characterization of the combinatorial structure underlying this hierarchy through Bell polynomial recursions, revealing how higher-order score functions encode the optimal transport map for signal denoising. We study two estimation strategies with convergence rates for higher-order scores from i.i.d. samples drawn from $Q$: (i) plug-in estimation via Gaussian kernel smoothing, and (ii) direct estimation via higher-order score matching. This hierarchy of agnostic denoisers opens new perspectives in signal denoising and empirical Bayes.

Summary

  • The paper introduces a hierarchy of denoisers that bridges the identity map and the optimal transport map using higher-order score expansions.
  • It leverages explicit Bell polynomial relationships to construct estimable polynomial functions solely from the observable data distribution.
  • Theoretical guarantees show that as the expansion order increases, the Wasserstein error contracts at a rate of O(eta^(K+1)) under smoothness assumptions.

Distributional Shrinkage and Hierarchical Optimal Transport Denoisers

Problem Definition and Motivations

The paper "Distributional Shrinkage II: Optimal Transport Denoisers with Higher-Order Scores" (2512.09295) addresses signal denoising from a distributional perspective. The setup consists of recovering a scalar signal XPX \sim P from noisy measurements Y=X+σZY = X + \sigma Z, with ZN(0,1)Z \sim \mathcal{N}(0,1) and σ>0\sigma>0 known, where only the distribution QQ of YY is observable. The paper frames the goal as distributional recovery: rather than minimizing mean squared error (MSE), focus is placed on distributional proximity between the denoised output and the original distribution PP, measured by the Wasserstein distance WrW_r.

The work motivates the investigation through the inadequacy of classical denoisers—such as the Bayes-optimal and empirical Bayes denoisers—which, while optimal for MSE, induce excessive shrinkage at the distributional level (cf. [liang2025distributional, garcia2024new, jaffe2025constrained]). This motivates a hierarchy of optimal transport (OT) denoisers, constructed to achieve close distributional alignment in Wasserstein distance, with improved performance as more structure is incorporated.

Hierarchy of Distributional Denoisers: Higher-Order Score Expansion

The central contribution is a hierarchy of denoisers, {TK}K=0\{T_K\}_{K=0}^\infty, which interpolate between the identity map (T0(y)=yT_0(y) = y) and the optimal transport map (TT_\infty), defined via the quantile-matching transformation T(y)=F1G(y)T_\infty(y) = F^{-1} \circ G(y) (where F1F^{-1}, GG are the quantile and CDF of PP and QQ, respectively).

Each denoiser TKT_K is constructed via a truncated series expansion

TK(y)=y+k=1Kηkk!hk(y),η=σ2/2,T_K(y) = y + \sum_{k=1}^K \frac{\eta^k}{k!} h_k(y), \qquad \eta = \sigma^2/2,

where the hk(y)h_k(y) are explicitly defined polynomial functions of higher-order score functions of QQ (with q=dQ/dyq = dQ/dy). These higher-order scores, recursively structured through partial Bell polynomials, encode information about the series expansion of the optimal transport map.

Crucially, these hkh_k depend exclusively on QQ; knowledge of PP is not required. Explicitly:

  • h1(y)=q(y)q(y)h_1(y) = \frac{q'(y)}{q(y)},
  • h2(y)h_2(y) involves q(y)q''(y), q(y)2q'(y)^2, and the derivative of h1h_1,
  • Higher orders involve further derivatives and cross-terms, recursively determined via Bell polynomial relations.

This combinatorial structure, characterized in closed form, is a key technical insight, as it links the denoising map to estimable objects from observable data.

Theoretical Guarantees: Distributional Accuracy and Noise Asymptotics

The paper establishes that for each KK, TKT_K achieves Wasserstein approximation to PP of order O(ηK+1)O(\eta^{K+1}), assuming sufficient smoothness and regularity. Specifically, for compact supp(P)\text{supp}(P) and sufficiently smooth densities,

Wr(TKQ,  P)ηK+1,W_r ( T_K \sharp Q, \; P ) \lesssim \eta^{K+1},

and similarly, the sup-norm error between TKT_K and TT_\infty contracts at the same rate as σ0\sigma \to 0. This quantifies a precise tradeoff between the order of expansion and denoising fidelity in Wasserstein space.

In the small-noise limit, higher KK results in dramatically improved matching—a property not paralleled by traditional MSE-minimizing denoisers.

Denoiser Estimation: Plug-In and Score Matching Approaches

Implementation of these denoisers from data relies on estimation of higher-order score functions of QQ, specifically the ratios q(m)/qq^{(m)} / q.

Two estimation paradigms are studied:

  • Gaussian Kernel Smoothing: For each derivative q(m)q^{(m)}, a plug-in estimator q^(m)\hat q^{(m)} is constructed from i.i.d. data, with optimal rates of convergence established (e.g., n4/(2m+5)n^{-4/(2m+5)} MSE rate for the mthm^{\text{th}} derivative at fixed yy).
  • Higher-Order Score Matching: Direct estimation of the score functions in function space via generalized score matching [hyvarinen2005estimation], with risk rates governed by the Hölder smoothness of the true score function (parametric rate n1/2n^{-1/2} possible for sufficient smoothness).

Both approaches yield empirically tractable procedures, with theoretical performance guarantees, for realizing the entire denoiser hierarchy from observed data.

Relation to Prior Literature

The approach fundamentally differs from classical empirical Bayes and shrinkage methods (Stein, James-Stein, etc.), which either assume prior knowledge or estimate the unobserved PP (g-modeling) before denoising. The current paper demonstrates that direct modeling and higher-order estimation in the observation space (YY) can recover the optimal transport map in distributional metrics without explicit prior estimation.

Prior work (e.g., [liang2025distributional]) analyzed lower-order OT denoisers or imposed structural assumptions on the prior. The present work generalizes the analytic framework to arbitrary order, makes explicit the combinatorics underpinning the construction, and characterizes the statistical estimation landscape for these higher-order objects.

Implications and Future Directions

These results provide a principled methodology for nonparametric, data-driven denoising that explicitly targets distributional reconstruction, with rapidly contracting error in Wasserstein space using only observable statistics.

From a practical standpoint, this framework is highly relevant to modern generative modeling, where distributional metrics (rather than pointwise error) are fundamental—for instance, in diffusion-based models, where denoising score estimators play a central role. The characterizations here suggest a principled route to estimator design with performance guarantees at the optimal transport level.

Theoretically, uncovering the combinatorial Bell polynomial structure behind denoising maps opens directions for further exploration, particularly in higher dimensions (where optimal transport maps are more complex), for other noise models, or in non-Euclidean settings.

Conclusion

The paper introduces and analyzes a hierarchy of distributional denoisers, constructed using higher-order score functions of the observed noisy distribution, with explicit characterization via Bell polynomials and rigorous guarantees on denoising quality in Wasserstein distance. It provides both the mathematical underpinnings and practical estimation tools to achieve distribution-matching denoising, representing a significant extension of distributional shrinkage theory and methodology (2512.09295).

Whiteboard

Paper to Video (Beta)

Open Problems

We found no open problems mentioned in this paper.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 21 likes about this paper.