KL Rescaling-Invariant PAC-Bayes Bounds

Updated 2 October 2025

The paper introduces a framework that integrates KL divergence with rescaling invariance to derive non-vacuous PAC-Bayes risk bounds.
It employs stability analysis and instance-dependent priors to achieve tight, data-driven generalization guarantees, notably in overparameterized models.
The approach is validated across randomized SVMs, robust estimators, stochastic neural networks, and deep ReLU classifiers, demonstrating its broad applicability.

KL-Based Rescaling-Invariant PAC-Bayes is a line of theory and algorithmic methodology for generalization risk certification in statistical learning that systematically integrates Kullback–Leibler (KL) divergence into PAC-Bayes risk bounds in a manner that is invariant under rescaling transformations of the loss, model parameters, or hypothesis representations. This approach combines stability analysis, data-dependent or instance-dependent priors, and rescaling strategies—leading to risk bounds that are sensitive to algorithmic stability and intrinsic problem geometry, and often yield non-vacuous guarantees even for high-dimensional predictors or overparameterized models. The theory is realized in several domains: randomized SVMs, robust estimators under heavy tails, stochastic neural networks, and functionally invariant representations for deep ReLU classifiers.

1. Foundations: PAC-Bayes Bounds and KL Divergence

Classical PAC-Bayes theory provides high-probability upper bounds on the expected generalization risk of randomized predictors, balancing empirical error with a complexity penalty often instantiated as KL(Q‖P), the divergence between a data-dependent posterior Q and a fixed or data-dependent prior P. The canonical form is:

$\operatorname{kl}(R_S(Q), R(Q)) \leq \frac{KL(Q\|P) + \log(f(n,\delta))}{n}$

where R(Q) is true risk, R_S(Q) empirical risk, and f(n, δ) encapsulates sample and confidence parameters. The use of KL divergence allows the theory to account for the informational complexity of moving from a prior belief to a posterior concentrated around a learned hypothesis.

However, the influence of rescaling—whether in the loss landscape, parameterizations (such as for deep ReLU networks), or instance-dependent geometry—means that the KL-based complexity can overestimate the true generalization challenge if it does not properly adapt to inherent problem invariances. This motivates the development of rescaling-invariant PAC-Bayes approaches.

2. Instance-Dependent Priors and Algorithmic Stability

A key methodology is the calibration of the prior not as a fixed, vacuous distribution, but as an instance-dependent or oracle prior reflecting the distribution of the algorithm's hypothesis outputs. For learning algorithms $\mathcal{A}$ with output in a Hilbert space $\mathcal{H}$ , stability is quantified by the sensitivity coefficient $\beta_n$ :

$\beta_n = \sup_{S_n,S'_n: S_n \sim S'_n} \|\mathcal{A}(S_n) - \mathcal{A}(S'_n)\|_\mathcal{H}$

With this, a Gaussian prior centered at the expected output ( $E[\mathcal{A}(S_n)]$ ) and a Gaussian posterior at the empirical output yields:

$KL(Q \| Q_0) = \frac{1}{2\sigma^2} \|\mathcal{A}(S_n) - E[\mathcal{A}(S_n)]\|^2$

The concentration induced by stability allows upper-bounding the KL term in the PAC-Bayes inequality by a function of $n \beta_n^2$ , giving risk bounds that are not only tighter but also sharply reflect the algorithm's sensitivity to resampling. In the case of SVMs with regularization parameter $\lambda$ , this specialization gives nontrivial, non-vacuous bounds of the explicit form:

$_+(R(Q, P_n) \| R(Q, P)) \leq \frac{2}{\sigma^2 \lambda^2 n^2}(1 + \sqrt{\frac{1}{2} \log (1/\delta)})^2 + \frac{\log((n+1)/\delta)}{n}$

This represents a PAC-Bayes bound whose KL-based penalty is adaptively "rescaled" to the intrinsic stability (and regularized complexity) of the algorithm (Rivasplata et al., 2018).

3. Robustness, Rescaling, and Heavy-Tailed Losses

When empirical averages may be misleading due to heavy-tailed losses, the approach is generalized by replacing the empirical mean with a robust, loss-rescaled estimator:

$\widehat{R}_\psi(h) = \frac{s}{n} \sum_{i=1}^n \psi\left(\frac{l(h; z_i)}{s}\right)$

where $\psi$ is a soft-truncation/Catoni function and $s$ is a scale parameter selected according to the (upper-bounded) second moment of the losses, $s^2 \propto n m_2 / 2\log(1/\delta)$ . This estimator enables PAC-Bayes bounds that require only the first three moments of the loss and achieve near-sub-Gaussian deviation rates:

$| \widehat{R}_\psi(h) - R(h) | \leq \sqrt{2 m_2 \log(1/\delta) / n}$

The corresponding Gibbs posterior is $\propto \exp(-\sqrt{n} \widehat{R}_\psi(h))$ , which retains the rescaling-invariance property: the statistical error bound is nearly invariant to the loss scale, entirely encoded via the dimensionless quantities in the bound (Holland, 2019).

4. Rescaling-Invariant Complexity via Invariant Lifts and Functional Representations

For systems with inherent parameter redundancy—such as ReLU networks, where the mapping from weight space to function space is non-injective—a KL-based PAC-Bayes analysis in parameter space can be vacuous or misleading. The rescaling-invariant framework proposes using "lifts": measurable mappings $\mathcal{T}: W \to Z$ such that the prediction function $f_w$ is determined solely by $(w) = \mathcal{T}(w)$ , and pushing forward the distributions:

$KL(\mathcal{T}_\#Q \| \mathcal{T}_\#P) \leq KL(Q \| P)$

This results in bounds that are invariant to symmetry-induced parameter transformations and thus measure only the "effective" generalization complexity. Theoretical work establishes an entire chain of inequalities comparing the standard parameter-space KL, deterministic and stochastic rescaling-invariant KL, and the lift-induced KL, always with the latter being minimal (Rouchouse et al., 30 Sep 2025):

KL term	Representation	Properties
$KL(Q \\| P)$	Weight space	Standard, may overcount
$\inf_\lambda KL(\triangle^\lambda_\# Q \\| \triangle^{\lambda'}_\# P)$	Rescaled weight spaces	Collapses scaling symmetry
$KL(\mathcal{T}_\# Q \\| \mathcal{T}_\# P)$	Lifted/invariant space	Smallest, maximally invariant

A practical deterministic rescaling proxy is computable via strictly convex optimization (e.g., block coordinate descent in log-parameters for Gaussian priors/posteriors), yielding non-vacuous bounds in overparameterized networks.

5. Data-Dependent Priors and Empirical Improvements

Further advances exploit the ability to tune priors in a data-dependent or partially data-dependent way. For nonconvex learners (e.g., deep neural nets trained by SGD), data-dependent oracle priors—priors defined as conditional expectations on subsets of the data—can dramatically reduce the KL penalty. This yields sharper, sometimes nonvacuous, generalization bounds even in difficult high-dimensional regimes and for nonconvex models:

$\text{Optimal prior} \quad P^*_{S_J} = E[Q(S) | S_J]$

$R(Q) \leq (1/\beta) R_S(Q) + \frac{KL(Q\|P^*_{S_J}) + \log(1/\delta)}{2\beta(1-\beta) n}$

Empirical tests show that using such priors with appropriate partitioning (tuning the fraction of data for the prior) can reduce PAC-Bayes generalization error bounds from values as high as 46% to approximately 11% on MNIST, and the optimal data split ratio depends intricately on architecture and training batch size (Dziugaite et al., 2020).

6. Robustness and Generalization in Heavy-Tailed and Multiclass Settings

KL-based rescaling-invariant PAC-Bayes bounds have also been extended to control entire error vectors (for multiclass or structured loss) rather than single scalars. By bounding the KL divergence between empirical and true error vectors over all partitions or error types:

$kl(R_S(Q) \| R_D(Q)) \leq \frac{KL(Q\|P) + \log(\xi(m, M)/\delta)}{m}$

This generality permits simultaneous control over uncountably many error weightings and ensures scale-invariance via the vectorized KL structure (Adams et al., 2022). Robustness is enhanced by incorporating sub-gamma (as opposed to sub-exponential) concentration, as exemplified in recent work on selective risk certification under heavy-tailed statistics (Akter et al., 16 Sep 2025).

7. Limitations, Alternative Divergences, and Extensions

Despite these advances, KL-based PAC-Bayes techniques remain sensitive to more subtle forms of representation redundancy and sometimes fail to yield non-vacuous bounds in situations where combinatorial or geometric complexity is not captured by the divergence—such as learning 1D thresholds, which possess simple sample complexity but can yield arbitrarily large KL penalties due to parameterization artifacts (Livni et al., 2020). This has motivated exploration of alternative complexity measures, e.g., Integral Probability Metrics (IPMs) (Amit et al., 2022), Wasserstein distances (Haddouche et al., 2023), and even "better-than-KL" divergences based on coin-betting and regret analysis (Kuzborskij et al., 14 Feb 2024), all designed to further decouple the generalization penalty from problematic scaling or parameter redundancy.

A further development is recursive PAC-Bayes, which generalizes the entire machinery for sequential prior updates without loss of confidence, via clever decompositions of the expected loss and generalizations of split-kl inequalities to handle non-binary losses (Wu et al., 23 May 2024). This allows bounding the generalization risk of posteriors produced by multiple training stages, with each updated prior capturing a potentially different rescaling of the loss landscape.

In summary, KL-based rescaling-invariant PAC-Bayes is a unifying analytical and algorithmic framework for deriving and computing tight, data-dependent generalization risk guarantees for randomized predictors. It achieves this by leveraging invariance under parameter/representation rescaling, stability analysis, instance-dependent priors, and advanced change-of-measure strategies, often outperforming earlier bounds in both tightness and interpretability across a diverse range of modern learning scenarios.