Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 211 tok/s Pro
GPT OSS 120B 416 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Rescaling-Invariant PAC-Bayes Bounds

Updated 2 October 2025
  • The paper introduces function-space PAC-Bayes bounds that overcome inflated KL divergence caused by rescaling in deep learning models.
  • It employs invariant representations and optimizes over rescaling orbits to tighten generalization guarantees, even halving the complexity term in practice.
  • The study integrates the data processing inequality within a DPI-PAC-Bayes framework to accommodate diverse f-divergences, ensuring bounds reflect intrinsic function complexity.

Rescaling-invariant PAC-Bayes bounds are generalization guarantees for learning algorithms constructed to remain stable under certain deterministic transformations—primarily, rescaling operations—of the hypothesis space or loss function. Originating from the intersection of information theory and statistical learning theory, these bounds address pathologies in traditional PAC-Bayesian formulations: when the hypothesis parametrization admits group invariances (as in ReLU neural networks), standard PAC-Bayes complexity terms (such as the Kullback–Leibler divergence between algorithm-dependent posterior and data-independent prior) can be artificially inflated by rescaling, even though the function represented remains unchanged. Recent research resolves this mismatch using invariant and optimized representations, advanced divergence measures, and by explicitly leveraging the data processing inequality (DPI) to ensure the bounds reflect the intrinsic functional—rather than parametric—complexity.

1. Rescaling Invariances in Function Classes and the PAC-Bayes Pathology

For function classes such as deep ReLU networks, many distinct parameter settings (e.g., through layer- or neuron-wise positive rescalings) represent the same function. The classic PAC-Bayes framework considers a posterior QQ and a prior PP in parameter (weight) space and penalizes their “distance” via DKL(QP)D_{\mathrm{KL}}(Q\|P), which can be drastically large under rescaling, even as the induced predictors are identical: fw=fλ(w)f_w = f_{\diamond^\lambda(w)} where λ\diamond^\lambda denotes the group action of neuron-wise rescalings. As a result, weight-space PAC-Bayes bounds can be arbitrarily loose or vacuous (Rouchouse et al., 30 Sep 2025).

To counter this, the “lifted” framework replaces the traditional parametric representation with an invariant representation: the space ZZ of path-product features (or more generally, features modded out by the invariance group), combined with the output function g:ZFg: Z \rightarrow F. The pushforward distributions Q_\sharp Q and P_\sharp P on ZZ can then serve as the complexity measures, and the data-processing inequality (DPI) ensures

$D(_\sharp Q\,\|\,_\sharp P) \le D(Q\|P)$

Given this structure, generalization bounds computed in ZZ are invariant under group actions (such as rescalings) and thus faithfully reflect the function complexity.

2. The DPI-PAC-Bayesian Framework and f-Divergences

The DPI-PAC-Bayesian framework (Guan et al., 20 Jul 2025) systematizes the integration of the Data Processing Inequality into PAC-Bayes change-of-measure arguments. It extends classical KL-based bounds to allow any ff-divergence (Rényi, Hellinger pp, Chi-squared), provided DfD_f obeys DPI: Df(PWQW)Df(PYQY)D_f(P_W\|Q_W) \geq D_f(P_Y\|Q_Y) for all measurable functions (channels) from WW to YY. This enables researchers to state generalization bounds where the complexity of a posterior is measured not simply at the parameter level, but in an information-theoretically “compressed” (coarser) or lifted space—crucial for functional invariances.

For any undesirable event EE (e.g., the generalization error exceeding a threshold), the key DPI-PAC-Bayesian lemma for Rényi divergence yields: P(E)[Q(E)](α1)/αexp(α1αDα(PQ)),α>1P(E) \le [Q(E)]^{(\alpha-1)/\alpha} \exp\left({\frac{\alpha-1}{\alpha} D_\alpha(P\|Q)}\right),\quad \alpha>1 and with uniform priors, recovers the classical Occam bound and eliminates the extraneous slack term (such as log(2n)/n\log(2\sqrt{n})/n) of standard PAC-Bayes analyses.

3. Rescaling-Invariant KL Optimization and Practical Algorithms

Given a group G\mathcal{G} of rescaling transformations acting on weights WW, and denoting the group action by λ(w)\diamond^\lambda(w), (Rouchouse et al., 30 Sep 2025) introduces two practical approaches:

  • Lifted Representation: PAC-Bayes bounds are computed in the invariant space ZZ (e.g., path-products + signs), yielding

$D(_\sharp Q\|\;_\sharp P)$

which is invariant by construction.

  • Optimization over Rescaling Orbits: The (rescaled) divergence is minimized over the orbit of the group:

infλ,λD(λQλP)\inf_{\lambda, \lambda'} D(\diamond^\lambda_\sharp Q\,\|\,\diamond^{\lambda'}_\sharp P)

Under common distributional choices (e.g., Gaussian priors/posteriors), this reduces to a convex problem that can be solved by block coordinate descent, with update rules exploiting layer-wise scaling symmetries. In square fully connected networks, for instance,

λ,k(CA)1/4\lambda_{\ell, k} \leftarrow \left(\frac{C_\ell}{A_\ell}\right)^{1/4}

with explicit formulas for AA_\ell and CC_\ell in terms of prior/posterior variances and current layer scaling.

Algorithmically, this can reduce the PAC-Bayes complexity term (KL) by a factor of four and thus halve the generalization bound—often transforming vacuous bounds into nonvacuous ones in large and deep networks.

4. Generalization Error Bounds via DPI and f-Divergences

The DPI-PAC-Bayesian framework yields explicit, invariant generalization error bounds in terms of the lifted (or optimized) divergence. Representative forms include:

Divergence PAC-Bayes Bound on KL Generalization Gap Notes
Rényi (order α>1\alpha>1) KL(L^(S,w)L(w))log(1/Qmin)+αα1log(1/δ)n\displaystyle KL(\hat L(S,w)\,\|\,L(w)) \leq \frac{\log(1/Q_{min}) + \frac{\alpha}{\alpha-1}\log(1/\delta)}{n} Tune α\alpha for tightness
Hellinger-pp (p>1p>1) KL(L^(S,w)L(w))log(Qmin1pδp1)(p1)n\displaystyle KL(\hat L(S,w)\,\|\,L(w)) \leq \frac{\log(\frac{Q_{min}^{1-p}}{\delta^p}-1)}{(p-1)n} For small event probabilities
Chi-squared KL(L^(S,w)L(w))log(1+QminQmin)+2log(1/δ)n\displaystyle KL(\hat L(S,w)\,\|\,L(w)) \leq \frac{\log(\frac{1+Q_{min}}{Q_{min}}) + 2\log(1/\delta)}{n} Parameter-free

When the prior is uniform, the limit as α\alpha\to\infty or pp\to\infty recovers the Occam’s Razor bound: KL(L^(S,w)L(w))log(1/Q(w))+log(1/δ)nKL(\hat L(S,w)\,\|\,L(w)) \leq \frac{\log(1/Q(w)) + \log(1/\delta)}{n} demonstrating sharpness and elimination of classical slack terms.

5. Information-Theoretic and Statistical Implications

Embedding DPI within PAC-Bayes delivers multiplicative penalties (as exponentials of divergence terms) that contract under coarse-grainings or invariance-aware lifts. The resulting generalization guarantees are thus not only tighter but also more robust: they capture only the intrinsic function complexity—immune to redundancy induced by model parametrization.

This approach clarifies the link between information theory (channels, data-processing) and statistical learning, explaining why bounds in the lifted or quotient space can be strictly tighter (never worse) than their parameter-space analogues: $D(_\sharp Q\|\;_\sharp P)\leq\inf_{\lambda, \lambda'} D(\diamond^\lambda_\sharp Q\|\diamond^{\lambda'}_\sharp P)\leq D(Q\|P)$

6. Practical Impact and Future Challenges

By collapsing rescaling redundancies, rescaling-invariant PAC-Bayes bounds yield nonvacuous generalization guarantees for overparameterized models—most notably for ReLU deep networks. They provide a mathematically rigorous account for why “function-space” complexity—rather than parameter count or norm—controls generalization.

Outstanding challenges include:

  • Designing efficient algorithms for distributions where pushforwards (under path+sign or similar lifts) are analytically intractable.
  • Extending invariant frameworks to broader transformation groups, including those for more complex network architectures with nontrivial symmetries.
  • Investigating implications in online, heavy-tailed, or robust learning settings, where further integration with stability or truncation methodologies may be required.

7. Summary Table: Key Approaches to Rescaling-Invariant PAC-Bayes Bounds

Approach Invariance Mechanism Algorithmic Resolution
Lifted (functionally-invariant) PAC-Bayes Pushforward to invariant ZZ Path+sign, channel mapping
Orbit Minimization (Rescaling Group) KL optimized over orbits Block coordinate descent (BCD)
DPI-F Divergence Bounds DPI on ff-divergence Occam, Hellinger, chi-squared

Rescaling-invariant PAC-Bayes bounds thus unify advances in information theory, optimization, and statistical learning to produce generalization guarantees faithful to the true capacity of function classes, resolving longstanding pathologies of parameter-space-based bounds in the presence of symmetry. Recent works (Rouchouse et al., 30 Sep 2025, Guan et al., 20 Jul 2025) provide rigorous theoretical and practical tools for their application to modern deep networks and other invariant-rich hypothesis spaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Rescaling-Invariant PAC-Bayes Bounds.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube