PAC-tuning: PAC-Bayes Perturbed Fine-Tuning
- PAC-tuning is a training method that optimizes neural network fine-tuning and pruning by directly minimizing PAC-Bayes risk bounds.
- It formulates a stochastic optimization problem where both parameter noise and pruning masks are learned to provide explicit generalization certificates.
- Empirical findings show that PAC-tuning improves test accuracy and robustness in sparse networks and large pretrained language models.
PAC-Bayes–Driven Perturbed Fine-Tuning (PAC-tuning) is a class of training algorithms for neural networks grounded in the PAC-Bayes framework, which leverages probabilistic modeling of parameters or pruning masks, and explicitly minimizes PAC-Bayes generalization bounds. PAC-tuning departs from standard heuristic fine-tuning or pruning workflows by constructing a stochastic optimization problem where both the learning objective and injected parameter perturbations are dictated by bounds on generalization risk. This approach encompasses two principal settings: (1) the optimization of stochastic pruning masks for sparse neural networks with PAC-Bayes certificates (Hayou et al., 2021), and (2) two-stage perturbed fine-tuning of large pretrained LLMs with bound-driven parameter noise (Liu et al., 2023).
1. PAC-Bayes Foundations and Generalization Bounds
PAC-tuning is fundamentally based on the PAC-Bayes theory of generalization, which provides non-vacuous risk bounds for Gibbs predictors parameterized by learned stochastic posteriors. Given (model parameter or mask distribution) prior and posterior $\Q$, the canonical risk bound takes the form
$\E_{\theta\sim\Q}[L_{\D}(\theta)] \leq \frac{1}{n}\sum_{i=1}^n\E_{\theta\sim\Q}[\ell(x_i, y_i; \theta)] + \frac{\ln(1/\delta) + \mathrm{KL}(\Q\Vert\P)}{\gamma n} + \gamma K^2,$
where is sample size, the loss, and bounds the variance of under . Minimizing this upper bound over both weights (or mask probabilities) and posterior variances yields predictors with explicit generalization guarantees (Hayou et al., 2021, Liu et al., 2023).
2. Stochastic Modeling: Pruning Masks and Parameter Uncertainty
In the context of neural network sparsification, PAC-tuning employs a stochastic pruning mask: for a given weight vector , the pruned weights are determined by a random vector where each entry independently. This corresponds to a spike-and-slab distribution on , inducing a Gibbs classifier which samples a mask for each prediction (Hayou et al., 2021).
For parameter fine-tuning of pretrained LLMs, PAC-tuning adopts blockwise Gaussian posteriors over parameters: $\Q^\theta_\xi = \mathcal{N}(\theta_t, \mathrm{diag}(\xi))$ and $\Q^\omega_\epsilon = \mathcal{N}(\omega_t, \mathrm{diag}(\epsilon))$ for the pretrained "body" and linear head, respectively (Liu et al., 2023). Optimizing these variances in Stage 1 determines the level of parameter perturbation injected during subsequent gradient updates.
3. Optimization Objectives and Algorithmic Structure
Pruning
PAC-tuning for pruning is instantiated in three stages:
- Prior pre-training: Train a dense network on a data subset to initialize mean weights and sparsity probabilities .
- Prior fine-tuning: Jointly optimize prior parameters to minimize expected loss over mask samples, utilizing the Gumbel-Softmax trick for differentiable mask probabilities.
- Posterior training: Initialize from prior, then jointly optimize to minimize the PAC-Bayes bound over the full data (Hayou et al., 2021).
The bound minimization involves a term
with explicit factorized spike-and-slab parameterizations for both prior and posterior.
Fine-tuning Large Models
For large pretrained models, the PAC-tuning algorithm proceeds in two stages:
- Stage 1: Jointly minimize the PAC-Bayes bound over both weights and posterior variances; parameterize posterior as block-diagonal Gaussians centered at current iterate, with learned variances per dimension (Liu et al., 2023).
- Stage 2: Inject fixed learned Gaussian noise into weights for perturbed gradient descent. At each iteration, sample , perturb parameters to get , compute loss, and update base parameters using gradient of perturbed loss. All other stochastic regularizers (e.g., dropout) are disabled to isolate the effect of bound-driven perturbations.
4. Data-Adaptive Regularization and Theoretical Insights
PAC-tuning for mask optimization exhibits implicit data-adaptive regularization in linear regression: optimizing stochastic mask probabilities with respect to the risk yields a penalty , where and is the empirical feature covariance—contrasting with the penalty induced by dropout (Hayou et al., 2021). For parameter noise, the learned posterior variances indicate directions of model robustness and correspond to flatness in the loss landscape, a property empirically linked to improved generalization (Liu et al., 2023). The process thus trades off sharp minima for broader, flatter optima aligned with PAC-Bayes theory.
5. Empirical Findings and Practical Guidelines
Experiments on pruned networks and LLM fine-tuning establish several key properties:
- Fine-tuning stochastic masks always improves test error over one-shot masks (e.g., magnitude, SNIP or random), especially at extreme sparsity (e.g., 95–99%). Thresholded masks derived after probabilistic fine-tuning outperform their initialization.
- Mask overlap between initial and optimized stochastic masks declines steeply with sparsity, often falling to 0–5% at 99% sparsity, suggesting that standard heuristics do not approximate local optima under the stochastic risk (Hayou et al., 2021).
- Noise-robustness: Pruned and probabilistically fine-tuned models exhibit greater robustness to additive Gaussian weight noise than dense baselines.
- PAC-tuning for encoders: On 5 GLUE tasks, BERT-base PAC-tuning achieves a mean score of 0.573, versus 0.547 for LoRA and 0.533 for vanilla fine-tuning. GPT-2 PAC-tuning achieves 0.486 compared to 0.462 for data augmentation and 0.461 for vanilla. CoLA, SST-2, QNLI, and RTE per-task results are consistently best or near-best. PAC-tuning is competitive or superior in few-shot regimes (20–100 examples) and on BERT-large (Liu et al., 2023).
Recommended hyperparameters include learning rates for pretrained/body () and head (), variance LRs (pretrained: 0.1; head: decay from 0.5 to 0.01), , batch size 32, and disabling dropout. Stage 1 typically requires 100–250 epochs for variance convergence; Stage 2 uses an additional 35–50 epochs.
6. Extensions and Theoretical Guarantees
Key theoretical and practical insights:
- Viewing pruning and fine-tuning as stochastic operations with optimized parameter or mask distributions turns sparsification and adaptation into PAC-Bayes bound minimization, yielding both tight generalization certificates and data-driven regularization effects absent from deterministic heuristics (Hayou et al., 2021, Liu et al., 2023).
- PAC-Bayes self-bounded learning returns not only a stochastic model but also a non-vacuous upper bound on its generalization error, empirically tight (within a few percentage points of test error) on standard vision benchmarks.
- Robustness and alignment effects can be related to feature realignment in wide networks, and the KL component in the bound controls complexity via the drift between prior and posterior alignments.
- Extensions proposed include replacing fully factorized spike-and-slab distributions with structured group sparsity, applying the methodology to iterative magnitude pruning (e.g., for lottery-ticket hypotheses), and scaling to large-scale architectures with approximate PAC-Bayes bounds (second-order or mutual information-based).
7. Significance, Limitations, and Open Problems
The PAC-tuning paradigm subsumes stochastic regularization, explicit generalization risk control, and principled fine-tuning within a unified optimization problem. In both mask-based pruning (Hayou et al., 2021) and parameter perturbation for LLMs (Liu et al., 2023), it demonstrates superior empirical accuracy and robustness, while providing statistically meaningful risk certificates. Limitations include the computational cost of PAC-Bayes bound minimization, especially for large models, and the stringency of bounds under extreme distributional shift or very limited data. A plausible implication is that continued development of tractable PAC-Bayes objectives and scalable posterior parameterizations will broaden the applicability of PAC-tuning across modalities and regimes.