PAC-Bayes Bounds for Multivariate Regression

Updated 17 December 2025

The paper introduces non-asymptotic PAC-Bayes bounds that combine empirical risk, KL-divergence penalties, and data-dependent concentration terms to guarantee generalization.
It employs truncation techniques and Gaussian priors to derive robust, dimension-free error rates even for heavy-tailed and unbounded data designs.
The study illustrates computational tractability with closed-form solutions and validates the approach through applications in large-scale linear autoencoder systems.

The PAC-Bayes bound for multivariate linear regression provides non-asymptotic, high-probability guarantees on the generalization error of stochastic linear regression predictors. These results are foundational for understanding the statistical learning properties of linear models in both classical and modern, high-dimensional regimes. The bounds unify empirical risk, complexity penalties (via Kullback–Leibler divergence with respect to a prior), and data-dependent concentration terms. They also extend beyond bounded/small-noise settings, encompassing heavy-tailed and unbounded designs under minimal moment assumptions.

1. Problem Setting and General PAC-Bayes Framework

In multivariate linear regression, the data consists of $m$ i.i.d. samples $S = \{(x_i, y_i)\}_{i=1}^m$ , with $x_i \in \mathbb{R}^n$ , $y_i \in \mathbb{R}^p$ . The hypothesis class is the set of linear maps $W \in \mathbb{R}^{p \times n}$ , producing $f_W(x) = W x$ . For a fixed predictor $W$ , the empirical and expected risks are defined as

$R^{\rm emp}(W) = \frac{1}{m} \sum_{i=1}^m \|y_i - W x_i\|_F^2, \qquad R^{\rm true}(W) = \mathbb{E}_{(x,y)\sim\mathcal{D}}\left[\|y - W x\|_F^2\right].$

The PAC-Bayes methodology controls the generalization risk of predictors randomly drawn from a "posterior" $\rho$ over $W$ , relative to a fixed "prior" $\pi$ . For loss function $\ell$ , the PAC-Bayes bound takes the form

$\mathbb{E}_{W\sim\rho}[R^{\rm true}(W)] \leq \mathbb{E}_{W\sim\rho}[R^{\rm emp}(W)] + \frac{1}{\lambda}\left( D_{\rm KL}(\rho\|\pi) + \ln\frac{1}{\delta} + \Psi_{\pi,\mathcal{D}}(\lambda, m) \right)$

with probability at least $1-\delta$ over the sample, for any $\lambda > 0$ and any posterior $\rho$ . The concentration term $\Psi_{\pi,\mathcal{D}}(\lambda, m)$ quantifies the deviation of empirical from expected risk, averaged under the prior (Guo et al., 15 Dec 2025); this is critical for tight, nonvacuous bounds.

2. Classical and Truncated Excess Risk Bounds

The canonical PAC-Bayes bound for linear regression (Audibert–Catoni, 2009) focuses on the least-squares setting with fixed feature map $\phi: \mathcal{X} \to \mathbb{R}^d$ , hypothesis class $F = \{f_w: x \mapsto w\cdot\phi(x): \|w\|_\infty \leq A\}$ , and square loss $\ell((x,y), f_w) = (y - f_w(x))^2$ (Audibert et al., 2010). Audibert–Catoni develop a truncation technique that replaces the standard Gibbs exponential weighting in the PAC-Bayes posterior with a soft-truncated form: $T(u) = -\log(1-u + \tfrac{1}{2}u^2), \quad u \in (-1, 1),$ applied to pairwise loss differences. This approach requires only bounded conditional output variance (not exponential moments), ensuring robustness to heavy tails. Under $L^\infty$ -boundedness of the feature map (diameter $H$ ) and variance bound $\sigma^2$ , with probability at least $1-\varepsilon$ over the sample and realization of $f \sim \rho$ , the excess risk satisfies

$R(f) - R(f^*) \leq (2\sigma + H)^2 \frac{C_1 d + C_2 \log{\varepsilon^{-1}}}{n}$

with explicit constants (e.g., $C_1 = 16.6$ , $C_2 = 12.5$ ), and its expectation yields a pure $O(d/n)$ rate. Notably, this result eliminates extraneous $\log n$ or condition-number factors common in ERM and Rademacher complexity analyses.

The truncation-based PAC-Bayes analysis extends to vector-valued $Y \in \mathbb{R}^k$ either coordinatewise or via matrix-valued features, leveraging the strong convexity of the quadratic loss. Sharp oracle inequalities of order $O((s\log d)/n)$ are available in sparse regimes using suitably structured (e.g., Laplace) priors (Audibert et al., 2010).

3. Dimension-Free and Heavy-Tailed Generalization

The Catoni–Giulini machinery generalizes the PAC-Bayes approach to multivariate regression and matrix mean estimation under only weak polynomial moment assumptions, allowing for heavy-tailed $X$ and $Y$ (Catoni et al., 2017). The key result is a dimension-free PAC-Bayes bound: $\forall\theta \in \mathbb{R}^d\,\,:\; |\hat{R}(\theta) - R(\theta)| \leq \frac{\beta}{2n}\|\theta\|^2 + \frac{\ln(1/\delta)}{n\lambda} + (\text{moment terms}),$ with $\lambda,\beta$ as tuning parameters and "moment terms" depending on higher-order moments of $X$ . For isotropic Gaussian priors and posteriors ( $\mathcal{N}(0,\beta^{-1}I)$ ), the bound is closed-form and, under just $\mathbb{E}\|X\|^4 < \infty$ , yields

$\sup_{\|\theta\| \leq B} |\hat{R}(\theta) - R(\theta)| = O\left(\sqrt{\frac{\ln(1/\delta)}{n}}\right)$

uniformly in $d$ . Adding a ridge penalty improves this to $O(\ln(1/\delta)/n)$ for regularized risk as long as the empirical/true Gram matrix's minimal eigenvalue is bounded away from zero.

This framework decouples empirical fit, complexity penalty ( $\mathrm{KL}(\rho\|\pi)/n$ ), and a confidence penalty, and is insensitive to the ambient dimension. The trade-off is tunable via prior/posterior variance ( $\beta$ ) and learning rate ( $\lambda$ ).

4. Extensions: Multivariate Output, Generalization with Unbounded Loss

Recent results extend PAC-Bayes theory to fully multivariate linear regression with unconstrained outputs and unbounded (squared-error) losses under minimal regularity. The framework in (Guo et al., 15 Dec 2025) generalizes the bound to $W \in \mathbb{R}^{p\times n}$ , yielding (with the same notation as above)

where, under a Gaussian data generative model, $\Psi_{\pi,\mathcal{D}}(\lambda, m)$ can be expressed explicitly via the moment generating function of a non-central chi-squared distribution (i.e., in terms of traces and quadratic forms involving $W$ and the population covariance). Under mild conditions (bounded-support or small-variance Gaussian prior), $\Psi_{\pi,\mathcal{D}}(\lambda, m)\to0$ as $m \to \infty$ , and the bound converges at $O(1/\sqrt{m})$ (Guo et al., 15 Dec 2025).

Parallel developments address the case of unbounded losses by introducing a HYPE (hypothesis-dependent range) condition (Haddouche et al., 2020). Here, for bounded design ( $\|x\|_2\leq B$ , $\|y\|_2\leq C$ ), the squared-error loss is $K(W)$ –compliant with $K(W) = (B\|W\|_F+C)^2$ , yielding generalized PAC-Bayes bounds scaling as $1/\sqrt{n}$ , with terms explicitly accounting for exponential moments under the prior.

5. Computational Aspects and Practical Implementation

Efficient optimization of the PAC-Bayes bound in the multivariate regime often employs Gaussian priors and posteriors, enabling closed-form expressions and tractable (if sometimes $O(n^3)$ ) computation. For entrywise Gaussian priors/posteriors, the optimal $\rho$ is given explicitly in terms of the data, prior mean, and variance, reducing the bound minimization to solving linear systems and matrix inversions (Guo et al., 15 Dec 2025). The tightness and empirical utility of such bounds have been established for large-scale systems (e.g., linear autoencoders with $n\sim10^4$ ), with observed bounds being non-vacuous (RH at most $2$– $3\times$ LH) and strongly correlated with practical metrics like Recall@K and NDCG@K.

Key computation steps include grid-search over $\lambda$ (with corresponding union penalty), eigendecompositions to evaluate $\Psi_{\pi,\mathcal{D}}$ , and possibility of further acceleration via structured priors or parameter constraints.

6. Comparison to Classical and Other Approaches

The PAC-Bayes bounds for multivariate regression provide several advantages relative to ERM and Rademacher complexity analyses:

No condition-number or log-factor inflation: Excess-risk bounds avoid $\log n$ and spectral-gap terms, depending primarily on intrinsic geometry (e.g., $L^\infty$ -diameter or norm constraints) (Audibert et al., 2010).
Minimal moment assumptions: Results hold under bounded conditional variance or weak polynomial moments, not requiring sub-Gaussianity or exponential moments (Catoni et al., 2017, Haddouche et al., 2020).
Dimension-free generalization: Rates such as $O(\sqrt{\ln(1/\delta)/n})$ (or $O(1/n)$ with regularization) are independent of $d$ if parameter domains or moments are controlled.
Extension to unbounded loss: The HYPE-based approach demonstrates the flexibility of PAC-Bayes methods beyond bounded losses, crucial in practical regression (Haddouche et al., 2020).

Other modern directions include privacy-preserving regression where PAC-Bayes bounds serve as statistical certificates within secure multiparty computation protocols (Gundersen et al., 2021).

7. Empirical Properties and Applications

Recent empirical evaluations validate the nonvacuousness and tightness of PAC-Bayes bounds for multivariate regression, especially in large-scale settings. In linear autoencoders for recommendation systems, PAC-Bayes bounds are not only computationally tractable, but also correlate strongly with downstream metrics, confirming their practical informativeness (Guo et al., 15 Dec 2025). Earlier concerns regarding vacuous bounds for deep networks ( $10^2$ – $10^3\times$ slack) are contrasted by the $2$– $3\times$ ratios observed in modern linear regimes. The ability to obtain non-vacuous, theoretically certified excess risk under weak data assumptions reinforces the value of the PAC-Bayes approach for both theoretical analysis and ML system certification.