Rescaling-Invariant PAC-Bayes Bounds
- The paper introduces function-space PAC-Bayes bounds that overcome inflated KL divergence caused by rescaling in deep learning models.
- It employs invariant representations and optimizes over rescaling orbits to tighten generalization guarantees, even halving the complexity term in practice.
- The study integrates the data processing inequality within a DPI-PAC-Bayes framework to accommodate diverse f-divergences, ensuring bounds reflect intrinsic function complexity.
Rescaling-invariant PAC-Bayes bounds are generalization guarantees for learning algorithms constructed to remain stable under certain deterministic transformations—primarily, rescaling operations—of the hypothesis space or loss function. Originating from the intersection of information theory and statistical learning theory, these bounds address pathologies in traditional PAC-Bayesian formulations: when the hypothesis parametrization admits group invariances (as in ReLU neural networks), standard PAC-Bayes complexity terms (such as the Kullback–Leibler divergence between algorithm-dependent posterior and data-independent prior) can be artificially inflated by rescaling, even though the function represented remains unchanged. Recent research resolves this mismatch using invariant and optimized representations, advanced divergence measures, and by explicitly leveraging the data processing inequality (DPI) to ensure the bounds reflect the intrinsic functional—rather than parametric—complexity.
1. Rescaling Invariances in Function Classes and the PAC-Bayes Pathology
For function classes such as deep ReLU networks, many distinct parameter settings (e.g., through layer- or neuron-wise positive rescalings) represent the same function. The classic PAC-Bayes framework considers a posterior and a prior in parameter (weight) space and penalizes their “distance” via , which can be drastically large under rescaling, even as the induced predictors are identical: where denotes the group action of neuron-wise rescalings. As a result, weight-space PAC-Bayes bounds can be arbitrarily loose or vacuous (Rouchouse et al., 30 Sep 2025).
To counter this, the “lifted” framework replaces the traditional parametric representation with an invariant representation: the space of path-product features (or more generally, features modded out by the invariance group), combined with the output function . The pushforward distributions and on can then serve as the complexity measures, and the data-processing inequality (DPI) ensures
$D(_\sharp Q\,\|\,_\sharp P) \le D(Q\|P)$
Given this structure, generalization bounds computed in are invariant under group actions (such as rescalings) and thus faithfully reflect the function complexity.
2. The DPI-PAC-Bayesian Framework and f-Divergences
The DPI-PAC-Bayesian framework (Guan et al., 20 Jul 2025) systematizes the integration of the Data Processing Inequality into PAC-Bayes change-of-measure arguments. It extends classical KL-based bounds to allow any -divergence (Rényi, Hellinger , Chi-squared), provided obeys DPI: for all measurable functions (channels) from to . This enables researchers to state generalization bounds where the complexity of a posterior is measured not simply at the parameter level, but in an information-theoretically “compressed” (coarser) or lifted space—crucial for functional invariances.
For any undesirable event (e.g., the generalization error exceeding a threshold), the key DPI-PAC-Bayesian lemma for Rényi divergence yields: and with uniform priors, recovers the classical Occam bound and eliminates the extraneous slack term (such as ) of standard PAC-Bayes analyses.
3. Rescaling-Invariant KL Optimization and Practical Algorithms
Given a group of rescaling transformations acting on weights , and denoting the group action by , (Rouchouse et al., 30 Sep 2025) introduces two practical approaches:
- Lifted Representation: PAC-Bayes bounds are computed in the invariant space (e.g., path-products + signs), yielding
$D(_\sharp Q\|\;_\sharp P)$
which is invariant by construction.
- Optimization over Rescaling Orbits: The (rescaled) divergence is minimized over the orbit of the group:
Under common distributional choices (e.g., Gaussian priors/posteriors), this reduces to a convex problem that can be solved by block coordinate descent, with update rules exploiting layer-wise scaling symmetries. In square fully connected networks, for instance,
with explicit formulas for and in terms of prior/posterior variances and current layer scaling.
Algorithmically, this can reduce the PAC-Bayes complexity term (KL) by a factor of four and thus halve the generalization bound—often transforming vacuous bounds into nonvacuous ones in large and deep networks.
4. Generalization Error Bounds via DPI and f-Divergences
The DPI-PAC-Bayesian framework yields explicit, invariant generalization error bounds in terms of the lifted (or optimized) divergence. Representative forms include:
Divergence | PAC-Bayes Bound on KL Generalization Gap | Notes |
---|---|---|
Rényi (order ) | Tune for tightness | |
Hellinger- () | For small event probabilities | |
Chi-squared | Parameter-free |
When the prior is uniform, the limit as or recovers the Occam’s Razor bound: demonstrating sharpness and elimination of classical slack terms.
5. Information-Theoretic and Statistical Implications
Embedding DPI within PAC-Bayes delivers multiplicative penalties (as exponentials of divergence terms) that contract under coarse-grainings or invariance-aware lifts. The resulting generalization guarantees are thus not only tighter but also more robust: they capture only the intrinsic function complexity—immune to redundancy induced by model parametrization.
This approach clarifies the link between information theory (channels, data-processing) and statistical learning, explaining why bounds in the lifted or quotient space can be strictly tighter (never worse) than their parameter-space analogues: $D(_\sharp Q\|\;_\sharp P)\leq\inf_{\lambda, \lambda'} D(\diamond^\lambda_\sharp Q\|\diamond^{\lambda'}_\sharp P)\leq D(Q\|P)$
6. Practical Impact and Future Challenges
By collapsing rescaling redundancies, rescaling-invariant PAC-Bayes bounds yield nonvacuous generalization guarantees for overparameterized models—most notably for ReLU deep networks. They provide a mathematically rigorous account for why “function-space” complexity—rather than parameter count or norm—controls generalization.
Outstanding challenges include:
- Designing efficient algorithms for distributions where pushforwards (under path+sign or similar lifts) are analytically intractable.
- Extending invariant frameworks to broader transformation groups, including those for more complex network architectures with nontrivial symmetries.
- Investigating implications in online, heavy-tailed, or robust learning settings, where further integration with stability or truncation methodologies may be required.
7. Summary Table: Key Approaches to Rescaling-Invariant PAC-Bayes Bounds
Approach | Invariance Mechanism | Algorithmic Resolution |
---|---|---|
Lifted (functionally-invariant) PAC-Bayes | Pushforward to invariant | Path+sign, channel mapping |
Orbit Minimization (Rescaling Group) | KL optimized over orbits | Block coordinate descent (BCD) |
DPI-F Divergence Bounds | DPI on -divergence | Occam, Hellinger, chi-squared |
Rescaling-invariant PAC-Bayes bounds thus unify advances in information theory, optimization, and statistical learning to produce generalization guarantees faithful to the true capacity of function classes, resolving longstanding pathologies of parameter-space-based bounds in the presence of symmetry. Recent works (Rouchouse et al., 30 Sep 2025, Guan et al., 20 Jul 2025) provide rigorous theoretical and practical tools for their application to modern deep networks and other invariant-rich hypothesis spaces.