Perturbed Supervised Fine-Tuning (PSFT)

Updated 6 December 2025

PSFT is a two-stage optimization framework that fine-tunes pretrained language models by injecting parameter-aware noise to enhance robustness, especially in few-shot settings.
It leverages PAC-Bayes theory to calibrate parameter variances, providing distributional generalization guarantees while balancing empirical loss and regularization.
Empirical results on GLUE benchmarks show that PSFT outperforms traditional fine-tuning methods, achieving notable gains in both BERT-base and GPT-2 models.

Perturbed Supervised Fine-Tuning (PSFT) is a two-stage optimization framework that fine-tunes pretrained LLMs by coupling distributional generalization guarantees from PAC-Bayes theory with explicit noise injection into gradient descent. This methodology is designed to enhance generalization, particularly in few-shot settings, by calibrating parameter-wise noise levels derived from PAC-Bayes bounds and applying them as stochastic perturbations during supervised fine-tuning. Its efficacy is empirically validated via the PAC-tuning algorithm, which achieves notable gains over common baselines on GLUE tasks (Liu et al., 2023).

1. Conceptual Foundation and Motivation

Supervised fine-tuning of pretrained LLMs (PLMs) is challenged by overfitting and suboptimal generalization, especially when training data is scarce. Traditional approaches employ regularizations such as data augmentation or pruning, but these require extensive hyperparameter tuning and can be incompatible with advanced optimizers. PSFT addresses these challenges by:

Leveraging PAC-Bayes theory to directly minimize a generalization bound, yielding parameter distributions rather than point estimates.
Automatically learning parameter-wise posterior variances that encode uncertainty and confidence in weight settings.
Utilizing these learned variances to inform noise injection during gradient updates, promoting robust generalization in downstream supervised tasks.

A plausible implication is that PAC-Bayes–motivated noise injection subsumes ad hoc regularization mechanisms, offering a theoretically justified, adaptive alternative for few-shot learning scenarios.

2. Two-Stage PAC-tuning Framework

PSFT is instantiated as the PAC-tuning pipeline, comprising the following sequential stages:

Stage 1: PAC-Bayes Training

Joint optimization of model body ( $\theta$ , typically frozen or slow-moving layers) and task-head parameters ( $\omega$ ).
Posterior variances ( $\xi$ for $\theta$ , $\epsilon$ for $\omega$ ) are directly minimized along with the empirical task loss.
The objective function is formulated as a PAC-Bayes generalization bound:

$J(D; \xi, \epsilon, \theta, \omega) = \frac{1}{m} \sum_{i=1}^m \ell(x_i, y_i; \theta, \omega) + \frac{\ln \frac{1}{\delta} + \mathrm{KL}(Q^\theta_\xi \Vert P^\theta_\lambda) + \mathrm{KL}(Q^\omega_\epsilon \Vert P^\omega_\beta)}{\gamma m} + \gamma K^2$

where $Q^\theta_\xi, Q^\omega_\epsilon$ are the learned Gaussian posteriors, $P^\theta_\lambda, P^\omega_\beta$ are priors, $\gamma$ a scaling parameter, and $K^2$ bounds task loss variance.

Output is a parameter initialization $(\theta^*, \omega^*)$ and parameter-wise noise variances $(\xi^*, \epsilon^*)$ .

Stage 2: Perturbed Supervised Fine-Tuning

Drop the bound term and fix $\xi \leftarrow \xi^*$ , $\epsilon \leftarrow \epsilon^*$ .
Apply supervised gradient descent with explicit noise injection:

$\theta_{t+1} = \theta_t - \eta_\theta \nabla_\theta \ell(x, y; \theta'_t, \omega'_t) + \xi_t,\quad \xi_t \sim \mathcal{N}(0,\, \Sigma^\theta)$

with noisy parameters

$\theta'_t = \theta_t + \xi_t,\quad \omega'_t = \omega_t + \zeta_t,\quad \zeta_t \sim \mathcal{N}(0,\, \Sigma^\omega)$

where $\Sigma^\theta,\, \Sigma^\omega$ are diagonal posterior variances.

A key insight is that the level of injected noise is not global but parameter-aware, linked directly to the PAC-Bayes bound's assessment of each parameter's safe variance.

3. Optimization Objectives and Update Mechanisms

The core Stage 1 objective, $J(D; \xi, \epsilon, \theta, \omega)$ , is minimized jointly over both parameter mean ( $\theta$ , $\omega$ ) and variance ( $\xi$ , $\epsilon$ ) subject to positivity constraints (projection $\xi \geq 0$ , $\epsilon \geq 0$ ). KL-divergence terms penalize KL between learned posteriors and priors, anchoring parameter values to prior beliefs unless strongly justified by data. Minimizing this yields parameter settings that balance data fit and capacity control.

Stage 2 focuses on standard supervised loss minimization, but each step includes stochastic perturbation:

Noise $\xi_t$ and $\zeta_t$ are sampled anew each batch from the previously learned variances.
Parameters are temporarily shifted, the gradient estimated, and then updated after reverting to the unperturbed state.

This mechanism acts analogously to classical perturbed gradient descent (PGD) but with theoretically derived and parameter-specific noise.

4. Empirical Results and Comparative Performance

PAC-tuning demonstrates strong performance gains on GLUE few-shot benchmarks:

For BERT-base: PAC-tuning achieves an average metric score of $0.573$, outperforming vanilla fine-tuning ($0.533$), DataAug ($0.536$), NoiseInject ($0.536$), LoRA ($0.547$), Prefix ($0.497$), and BitFit ($0.510$).
On GPT-2, PAC-tuning yields an average $0.486$ versus $0.461$ for vanilla, a $+2.5$ point improvement.
Per-task improvements include CoLA $0.335$ vs $0.235$, SST $0.834$ vs $0.773$, and RTE $0.601$ vs $0.589$ with PAC-tuning.

Method	BERT-base Avg.	GPT-2 Avg.
Vanilla	0.533	0.461
LoRA	0.547	—
PAC-tuning	0.573	0.486

The results indicate robust superiority versus conventional and competitive baselines, especially in few-shot scenarios.

5. Practical Implementation Details

Empirical configuration employs AdamW optimizer (β₁ = 0.9, β₂ = 0.98, $\epsilon$ = 1e-3) with learning rate schedules and regularization as follows:

Stage 1: lr $_\theta=5\times10^{-5}$ , lr $_\omega=1\times10^{-2}$ , lr $_\xi=0.1$ , lr $_\epsilon$ decayed $0.5 \rightarrow 0.01$ , batch size 32, max 300 epochs.
Stage 2: $\eta_\theta=5\times10^{-5}$ , $\eta_\omega=1\times10^{-2}$ , $\approx$ 35 epochs.
Bound parameters: $\gamma=10$ for CoLA/SST, $\gamma=5$ otherwise; $\delta=1\times10^{-6}$ ; prior variances initialized to parameter magnitudes.
Regularization: weight decay $0.01$; dropout is disabled to avoid interference with explicit noise.

Below is condensed pseudocode highlighting the two-stage process:

for epoch in 1..epochs1:
    Compute J and gradients w.r.t θ, ω, ξ, ε
    Update θ, ω, ξ, ε (project ξ, ε ≥ 0)
Save Σ^θ, Σ^ω

for epoch in 1..epochs2:
    for batch:
        Sample noise ξ_t ∼ N(0, Σ^θ), ζ_t ∼ N(0, Σ^ω)
        θ', ω' = θ + ξ_t, ω + ζ_t
        Compute gradients and update θ, ω

6. Theoretical Properties and Intuition

PAC-driven noise injection confers several theoretically grounded benefits:

Principled, parameter-wise variance selection supersedes untuned, isotropic perturbations.
The noise regularizes optimization trajectories, favoring flat minima evidenced by low Hessian trace—an established proxy for generalization.
Stochastic perturbations facilitate escape from sharp local minima and saddle points, improving optimization robustness.
PAC-Bayes bound terms counteract excessive variance, maintaining a tradeoff between training fit and model complexity.
Compared to vanilla fine-tuning (which minimizes empirical loss) and untuned PGD, the approach yields consistent generalization improvements due to its capacity-aware and theoretically justified noise calibration.

A plausible implication is that PSFT fundamentally reconfigures fine-tuning optimizers toward distributional smoothness, with PAC-Bayes bounds as a governing principle.

7. Context and Extensions

The PAC-tuning instantiation of PSFT is grounded in the work of Zhang et al. (2023) (Liu et al., 2023), addressing limitations of prior regularization methods and extending applicability to any downstream setting where Adam or standard optimizers are deployed. The methodology is notably effective for few-shot tasks but may generalize to broader domains where overfitting and generalization error are concerns.

Dropout and other regularizations are intentionally disabled during PSFT, suggesting that explicit, bound-informed noise perturbation can substitute or even outperform conventional schemes. Further investigation may consider alternative prior/posterior constructions or their impact on learned variances and generalization properties.

In summary, Perturbed Supervised Fine-Tuning via the PAC-tuning paradigm constitutes a PAC-Bayes-grounded, empirically validated, and theoretically motivated method for robust adaptation of PLMs, with measurable improvements over established fine-tuning approaches.

PDF Markdown Chat (Pro)

References (1)

PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven Perturbed Gradient Descent (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Perturbed Supervised Fine-Tuning (PSFT).