Non-Asymptotic Excess Risk Bounds

Updated 28 October 2025

The paper extends traditional mutual information risk bounds by using generalized divergences (Rényi, α-Jensen-Shannon, Sibson) to derive non-asymptotic excess risk bounds.
It employs variational characterizations and data-dependent sub-Gaussian parameters to rigorously bound the degradation in prediction performance under noisy or compressed observations.
These bounds enhance flexibility and tightness, offering practical performance guarantees that outperform standard MI-based approaches in complex, high-variance settings.

Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures

The theory of non-asymptotic excess minimum risk bounds quantifies the degradation in best achievable prediction performance when an estimator relies on a stochastically degraded feature vector (such as a compressed or noise-corrupted observation) rather than the full-information variable. Recent research has advanced these bounds using generalized information-theoretic divergences, extending previous approaches that depended solely on mutual information and assumed constant sub-Gaussianity. This article provides a comprehensive account of such generalized excess risk bounds, centering on Rényi, α-Jensen-Shannon, and Sibson mutual information, and compares these results with existing approaches, highlighting their advantages in flexibility, tightness, and generality.

1. Problem Setting and Excess Risk Formulation

Let random vectors $Y$ , $X$ , and $Z$ form a Markov chain ( $Y \to X \to Z$ ), where $Y$ is the target variable to be estimated from $X$ or $Z$ . For a loss function $l$ , the excess minimum risk is

$L^*_l(Y|Z) - L^*_l(Y|X),$

where $L^*_l(Y|W)$ denotes the minimum possible expected loss for predicting $Y$ from $W$ . The central goal is to upper bound this excess, in terms of divergences between the conditional distributions $P_{X|Y,Z}$ and $P_{X|Z}$ , while allowing general (possibly non-constant) sub-Gaussianity for the loss.

2. Non-Asymptotic Excess Risk Bounds via Generalized Divergences

2.1 Conditional Rényi Divergence Bound

Under the condition that for each $y$ , the function $l(y, f(X))$ (for optimal $f$ ) is conditionally $\sigma^2(y)$ -sub-Gaussian given $Z$ , with $\E[\sigma^2(Y)] < \infty$, the main theorem establishes that for any $\alpha \in (0,1)$ ,

$L^*_l(Y|Z) - L^*_l(Y|X) \leq \sqrt{\frac{2\E[\sigma^2(Y)]}{\alpha} D_\alpha (P_{X|Y,Z} \| P_{X|Z} \mid P_{Y,Z})},$

where $D_\alpha$ is the conditional Rényi divergence of order $\alpha$ (Equation (LT1)).

For bounded loss, $\|l\|_\infty$ , the corollary gives

$L^*_l(Y|Z) - L^*_l(Y|X) \leq \frac{ \|l\|_\infty \sqrt{2} }{ \sqrt{\alpha} } \sqrt{ D_\alpha (P_{X|Y,Z} \| P_{X|Z} \mid P_{Y,Z}) }.$

As $\alpha \rightarrow 1$ , these bounds recover mutual information–based inequalities such as those in Györfi et al. (2023): $L^*_l(Y|Z) - L^*_l(Y|X) \leq \sqrt{ 2\, \mathbb{E}[\sigma^2(Y)] I(X;Y|Z) }.$

2.2 Bounds via Conditional α–Jensen-Shannon and Sibson Information

For the α-Jensen-Shannon (JS) divergence, under sub-Gaussianity with respect to a convex mixture of $P_{X|Z,Y=y}$ and $P_{X|Z}$ , the excess risk is bounded by (Equation (LT1JS)): $L^*_l(Y|Z) - L^*_l(Y|X) \leq \sqrt{ \frac{2\E[\sigma^2(Y)]}{\alpha(1-\alpha)} JS_\alpha(P_{Y,X|Z}\,\|\,P_{Y|Z}P_{X|Z}\,|\,P_Z) }.$ Again, for bounded loss,

$L^*_l(Y|Z) - L^*_l(Y|X) \leq \frac{ \|l\|_\infty \sqrt{2} }{ \sqrt{ \alpha(1-\alpha) } } \sqrt{ JS_\alpha( P_{Y,X|Z} \| P_{Y|Z}P_{X|Z} | P_Z ) }.$

In the limit $\alpha\to0$ , the JS bound recovers the mutual information result; as $\alpha\to1$ , it yields a bound involving the Lautum information (reverse KL).

Similarly, an excess risk bound is derived in terms of conditional Sibson mutual information: $L^*_l(Y|Z) - L^*_l(Y|X) \leq \sqrt{ \frac{ 2 [ (1-\alpha) \sigma^2 + \alpha \mathbb{E}_{P_Z}[ \Phi_{Y^*|Z}( \gamma^2(Y^*) ) ] ] }{ \alpha } \, \mathbb{E}_{P_Z}[ I_\alpha^S( P_{Y,X|Z}, P_{Y^*|Z} ) ] }.$ All these bounds recover the mutual information-based upper bound as $\alpha \to 1$ .

2.3 Construction and Proof Techniques

Derivations combine:

Risk representation via the difference of conditional (or joint vs. product) distributions.
Variational characterizations of divergences: Donsker-Varadhan for Rényi, analogous forms for JS and Sibson.
Conditional and possibly data-dependent sub-Gaussianity assumptions.
Auxiliary mixture distributions for JS and Sibson via techniques first systematized for generalization error in learning theory [Aminian et al., 2024].

3. Relation to Prior Work and Advantages

The generalized divergence framework advances earlier results by:

Removing the restriction that the sub-Gaussian parameter be a global constant (allowing, e.g., $l(y, f(x))$ to have $\sigma^2(y)$ dependence).
Encompassing the entire canonical range $(0,1)$ of α, permitting tighter numerical optimization of the bound in applications.
Reducing to the standard mutual information bound for $\alpha \to 1$ , thus strictly generalizing Györfi et al. (2023) (Omanwar et al., 30 May 2025), Modak et al. (2021), Aminian et al. (2024).
Providing strictly tighter bounds in challenging regimes, such as high cardinality discrete models and certain heavy-tailed continuous settings.

4. Application Examples and Comparison

Example 1: $q$ -ary Channel, Bounded Loss

For $Y \to X \to Z$ where the channel is symmetric and $l$ is bounded, the $\alpha$ –JS bound is provably tighter than the mutual information bound for intermediate values of $\alpha$ , with more pronounced advantage as $q$ grows (see paper Figures 1–3).

Example 2: Additive Gaussian Model, Heterogeneous Sub-Gaussianity

With $Y \sim \mathcal{N}(0,1)$ , $X = Y + W_1$ , $Z = X + W_2$ , and heavy-tailed or capped loss, the conditions for generalized bounds are satisfied, whereas constant-parameter MI-based bounds may be vacuous due to unbounded tails or lack of uniform control.

Example 3: Reverse Markov Chain

Swapping the roles of degraded variable and label, the method still provides sharper bounds than mutual information for suitable $\alpha$ .

Summary Table

Bound Type	Excess Risk Bound Formulation (α∈(0,1))	Bounded Loss?	Non-constant Sub-Gaussianity?	Recovers MI as α→1	Often Tighter Than MI?
Rényi divergence	$\sqrt{\frac{2\E[\sigma^2(Y)]}{\alpha} D_\alpha(\cdot)}$	Yes	Yes	Yes	Yes, for some α
α–Jensen-Shannon	$\sqrt{\frac{2\E[\sigma^2(Y)]}{\alpha(1-\alpha)} JS_\alpha(\cdot)}$	Yes	Yes	Yes	Yes, for some α
Sibson information	$\sqrt{ \frac{ 2((1-\alpha)\sigma^2 + \alpha \E[\Phi_{Y^\|Z}(\gamma^2(Y^) ) ]) }{\alpha} \E[I_\alpha^S(\cdot)]}$	Yes	Yes	Yes	Sometimes

5. Flexibility, Practicality, and Impact

Sub-Gaussian parameters may be data-dependent random variables, accommodating much broader classes of models (including unbounded $Y$ or $l$ ).
The parameter $\alpha$ serves as a trade-off dial, allowing practitioners to numerically optimize the upper bound for their scenario.
In practice, for large-alphabet channels or continuous models with non-constant noise, these generalized divergence bounds can offer substantially tighter excess risk guarantees, as shown via simulation in the paper.
No longer is the practitioner restricted by crude worst-case sub-Gaussian parameters that can trivialize classical mutual-information bounds in realistic high-variance settings.

6. Concluding Remarks

The generalized divergence approach to excess minimum risk delivers a unifying, parameterized family of non-asymptotic risk bounds. These strictly encompass and frequently improve upon the standard mutual information bounds by modulating both the divergence and the moment assumptions. The framework covers Rényi, Jensen-Shannon, and Sibson’s mutual informations, and is strictly more flexible and sharp, as confirmed by both theory and practice. It stands as a significant advance beyond lossless or MI-only generalization error analysis, with broad relevance to information theory, generalization in learning, and downstream inference under measurement or communication constraints.

References

Györfi, L., et al. "Excess Risk Bounds in Statistical Inference via Mutual Information," Entropy, 2023.
Modak, S., et al., ITW 2021.
Aminian, M., et al., JSAIT 2024 (Omanwar et al., 30 May 2025).
Donsker, M. D. and Varadhan, S. R. S., "Asymptotic evaluation of certain Markov process expectations for large time," 1975.

PDF Markdown Chat (Pro)

References (1)

Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures (2025)

Follow Topic

Get notified by email when new papers are published related to Non-Asymptotic Excess Risk Bounds.