Non-Asymptotic Excess Risk Bounds
- The paper extends traditional mutual information risk bounds by using generalized divergences (Rényi, α-Jensen-Shannon, Sibson) to derive non-asymptotic excess risk bounds.
- It employs variational characterizations and data-dependent sub-Gaussian parameters to rigorously bound the degradation in prediction performance under noisy or compressed observations.
- These bounds enhance flexibility and tightness, offering practical performance guarantees that outperform standard MI-based approaches in complex, high-variance settings.
Bounds on the Excess Minimum Risk via Generalized Information Divergence Measures
The theory of non-asymptotic excess minimum risk bounds quantifies the degradation in best achievable prediction performance when an estimator relies on a stochastically degraded feature vector (such as a compressed or noise-corrupted observation) rather than the full-information variable. Recent research has advanced these bounds using generalized information-theoretic divergences, extending previous approaches that depended solely on mutual information and assumed constant sub-Gaussianity. This article provides a comprehensive account of such generalized excess risk bounds, centering on Rényi, α-Jensen-Shannon, and Sibson mutual information, and compares these results with existing approaches, highlighting their advantages in flexibility, tightness, and generality.
1. Problem Setting and Excess Risk Formulation
Let random vectors , , and form a Markov chain (), where is the target variable to be estimated from or . For a loss function , the excess minimum risk is
where denotes the minimum possible expected loss for predicting from . The central goal is to upper bound this excess, in terms of divergences between the conditional distributions and , while allowing general (possibly non-constant) sub-Gaussianity for the loss.
2. Non-Asymptotic Excess Risk Bounds via Generalized Divergences
2.1 Conditional Rényi Divergence Bound
Under the condition that for each , the function (for optimal ) is conditionally -sub-Gaussian given , with $\E[\sigma^2(Y)] < \infty$, the main theorem establishes that for any ,
$L^*_l(Y|Z) - L^*_l(Y|X) \leq \sqrt{\frac{2\E[\sigma^2(Y)]}{\alpha} D_\alpha (P_{X|Y,Z} \| P_{X|Z} \mid P_{Y,Z})},$
where is the conditional Rényi divergence of order (Equation (LT1)).
For bounded loss, , the corollary gives
As , these bounds recover mutual information–based inequalities such as those in Györfi et al. (2023):
2.2 Bounds via Conditional α–Jensen-Shannon and Sibson Information
For the α-Jensen-Shannon (JS) divergence, under sub-Gaussianity with respect to a convex mixture of and , the excess risk is bounded by (Equation (LT1JS)): $L^*_l(Y|Z) - L^*_l(Y|X) \leq \sqrt{ \frac{2\E[\sigma^2(Y)]}{\alpha(1-\alpha)} JS_\alpha(P_{Y,X|Z}\,\|\,P_{Y|Z}P_{X|Z}\,|\,P_Z) }.$ Again, for bounded loss,
In the limit , the JS bound recovers the mutual information result; as , it yields a bound involving the Lautum information (reverse KL).
Similarly, an excess risk bound is derived in terms of conditional Sibson mutual information: All these bounds recover the mutual information-based upper bound as .
2.3 Construction and Proof Techniques
Derivations combine:
- Risk representation via the difference of conditional (or joint vs. product) distributions.
- Variational characterizations of divergences: Donsker-Varadhan for Rényi, analogous forms for JS and Sibson.
- Conditional and possibly data-dependent sub-Gaussianity assumptions.
- Auxiliary mixture distributions for JS and Sibson via techniques first systematized for generalization error in learning theory [Aminian et al., 2024].
3. Relation to Prior Work and Advantages
The generalized divergence framework advances earlier results by:
- Removing the restriction that the sub-Gaussian parameter be a global constant (allowing, e.g., to have dependence).
- Encompassing the entire canonical range of α, permitting tighter numerical optimization of the bound in applications.
- Reducing to the standard mutual information bound for , thus strictly generalizing Györfi et al. (2023) (Omanwar et al., 30 May 2025), Modak et al. (2021), Aminian et al. (2024).
- Providing strictly tighter bounds in challenging regimes, such as high cardinality discrete models and certain heavy-tailed continuous settings.
4. Application Examples and Comparison
Example 1: -ary Channel, Bounded Loss
For where the channel is symmetric and is bounded, the –JS bound is provably tighter than the mutual information bound for intermediate values of , with more pronounced advantage as grows (see paper Figures 1–3).
Example 2: Additive Gaussian Model, Heterogeneous Sub-Gaussianity
With , , , and heavy-tailed or capped loss, the conditions for generalized bounds are satisfied, whereas constant-parameter MI-based bounds may be vacuous due to unbounded tails or lack of uniform control.
Example 3: Reverse Markov Chain
Swapping the roles of degraded variable and label, the method still provides sharper bounds than mutual information for suitable .
Summary Table
| Bound Type | Excess Risk Bound Formulation (α∈(0,1)) | Bounded Loss? | Non-constant Sub-Gaussianity? | Recovers MI as α→1 | Often Tighter Than MI? |
|---|---|---|---|---|---|
| Rényi divergence | $\sqrt{\frac{2\E[\sigma^2(Y)]}{\alpha} D_\alpha(\cdot)}$ | Yes | Yes | Yes | Yes, for some α |
| α–Jensen-Shannon | $\sqrt{\frac{2\E[\sigma^2(Y)]}{\alpha(1-\alpha)} JS_\alpha(\cdot)}$ | Yes | Yes | Yes | Yes, for some α |
| Sibson information | $\sqrt{ \frac{ 2((1-\alpha)\sigma^2 + \alpha \E[\Phi_{Y^*|Z}(\gamma^2(Y^*) ) ]) }{\alpha} \E[I_\alpha^S(\cdot)]}$ | Yes | Yes | Yes | Sometimes |
5. Flexibility, Practicality, and Impact
- Sub-Gaussian parameters may be data-dependent random variables, accommodating much broader classes of models (including unbounded or ).
- The parameter serves as a trade-off dial, allowing practitioners to numerically optimize the upper bound for their scenario.
- In practice, for large-alphabet channels or continuous models with non-constant noise, these generalized divergence bounds can offer substantially tighter excess risk guarantees, as shown via simulation in the paper.
- No longer is the practitioner restricted by crude worst-case sub-Gaussian parameters that can trivialize classical mutual-information bounds in realistic high-variance settings.
6. Concluding Remarks
The generalized divergence approach to excess minimum risk delivers a unifying, parameterized family of non-asymptotic risk bounds. These strictly encompass and frequently improve upon the standard mutual information bounds by modulating both the divergence and the moment assumptions. The framework covers Rényi, Jensen-Shannon, and Sibson’s mutual informations, and is strictly more flexible and sharp, as confirmed by both theory and practice. It stands as a significant advance beyond lossless or MI-only generalization error analysis, with broad relevance to information theory, generalization in learning, and downstream inference under measurement or communication constraints.
References
- Györfi, L., et al. "Excess Risk Bounds in Statistical Inference via Mutual Information," Entropy, 2023.
- Modak, S., et al., ITW 2021.
- Aminian, M., et al., JSAIT 2024 (Omanwar et al., 30 May 2025).
- Donsker, M. D. and Varadhan, S. R. S., "Asymptotic evaluation of certain Markov process expectations for large time," 1975.