Learning to Perturb Gradients (LPG)

Updated 4 July 2026

Learning to Perturb Gradients (LPG) is a class-aware method that adaptively adjusts logit gradients during training to improve deep neural network performance.
It unifies techniques like SAM, clipping, and noise injection by replacing raw gradients with class-conditioned perturbed signals computed using closed-form or iterative methods.
Empirical evaluations on balanced, long-tail, and noisy label tasks demonstrate LPG’s ability to boost accuracy and generalization while reducing overfitting.

Searching arXiv for the cited LPG-related papers and closely related work. Learning to Perturb Gradients (LPG) most specifically denotes a class-aware gradient perturbation method for adaptive training of deep neural networks, in which the raw backward signal $g$ is replaced by a perturbed gradient $\tilde g = g + \delta_g$ and the perturbation is learned in logit-gradient space rather than directly in parameter space (Li, 28 May 2026). In a broader perturbation-based lineage, closely related ideas appear in decision-focused learning, where finite perturbations approximate directional derivatives of downstream decision loss, and in embedded optimization layers, where backward signals are obtained by re-solving a perturbed optimization problem (Huang et al., 2024, Paulus et al., 2024). Earlier perturb-and-MAP learning for structured prediction is not LPG in the modern sense, but it is a close ancestor because it uses random perturbations and perturbed MAP solutions to construct stochastic gradient estimates (Shpakova et al., 2018).

1. Conceptual scope and nomenclature

The term “Learning to Perturb Gradients” is not attached to a single perturbation mechanism across the literature. In its narrow contemporary use, it refers to adaptive perturbation of gradients during deep network training. In a wider methodological sense, it names a family resemblance: gradients or gradient surrogates are made informative by perturbing an optimization or inference problem and observing the induced change in solver output, objective value, or effective update direction. The common thread is not a shared implementation, but the use of controlled perturbations to replace degenerate, discontinuous, or insufficiently task-aligned derivatives (Li, 28 May 2026, Huang et al., 2024, Paulus et al., 2024, Shpakova et al., 2018).

Strand	Perturbation object	Primary role
LPG (Li, 28 May 2026)	Logit-level gradients at the class level	Category-aware adaptive training
PG losses (Huang et al., 2024)	Optimization objective in the label direction	Decision-aware surrogate loss
LPGD (Paulus et al., 2024)	Embedded optimization problem in the backward pass	Replacement derivative for optimization layers
Marginal perturb-and-MAP (Shpakova et al., 2018)	Fixed Gumbel perturbations in MAP inference	Stochastic learning of structured probabilistic models

A recurrent misconception is to treat these strands as identical algorithms. They are better understood as related perturbation-based approaches with different objects of perturbation, different optimization targets, and different theoretical guarantees. The 2026 LPG method is explicitly about adaptive training by perturbing backward signals; PG losses and LPGD instead perturb optimization problems to obtain informative gradients; marginal perturb-and-MAP uses fixed Gumbel noise to approximate probabilistic quantities rather than learning perturbations themselves.

2. Unified gradient-perturbation viewpoint

The 2026 LPG formulation starts from a simple abstraction of training: the backward chain supplies gradients that determine parameter updates, and many apparently different optimization heuristics can be rewritten as replacing $g$ by a perturbed gradient $\tilde g = g + \delta_g$ (Li, 28 May 2026).

Within this view, Sharpness-Aware Minimization, gradient clipping, and gradient noise injection are interpreted as specific instances of gradient perturbation. The paper states that SAM induces a curvature-dependent perturbation of the gradient via Taylor expansion, clipping is a norm-dependent perturbation, and gradient noise injection is an isotropic perturbation. The unifying claim is that these methods differ only in how they define $\delta_g$ , while sharing the same structural effect: they alter the backward signal before the parameter update.

The paper argues that what these methods lack is class-aware adaptability. SAM, clipping, and noise do not adapt perturbations according to class properties such as accuracy, frequency, or label quality. LPG introduces this missing degree of freedom through two conjectures inspired by Logit Perturbation Learning: amplifying the gradient norm for a class acts as positive augmentation, whereas dampening it acts as negative augmentation. In this formulation, gradient magnitude becomes an explicit proxy for learning intensity: stronger gradients enhance learning for a class, and weaker gradients suppress overfitting for that class.

This emphasis on the backward chain marks a shift in perturbation methodology. Feature perturbation, logit perturbation, and label perturbation operate on the forward path; LPG treats gradient perturbation as the backward-path counterpart.

3. Logit-gradient formulation of LPG

LPG is implemented not in parameter space but in the lower-dimensional logit-gradient space. For a sample $x_i$ with label $y_i$ , the network outputs logits

$u_i = f(x_i; W) \in \mathbb{R}^C,$

with cross-entropy loss

$l(u_i, y_i) = -\log p_{y_i}, \qquad p_{y_i} = \frac{\exp(u_{i,y_i})}{\sum_{j=1}^C \exp(u_{i,j})}.$

The paper defines the parameter gradient $g_i = \nabla_W l(f(x_i; W), y_i)$ , the class average gradient $\tilde g = g + \delta_g$ 0, the logit gradient $\tilde g = g + \delta_g$ 1, the class average logit gradient $\tilde g = g + \delta_g$ 2, and the Jacobian $\tilde g = g + \delta_g$ 3. By the chain rule,

$\tilde g = g + \delta_g$ 4

This factorization is the central design choice of LPG: perturb $\tilde g = g + \delta_g$ 5 in $\tilde g = g + \delta_g$ 6, then let the Jacobian propagate the perturbation into parameter space (Li, 28 May 2026).

Standard training performs

$\tilde g = g + \delta_g$ 7

LPG replaces this with

$\tilde g = g + \delta_g$ 8

where $\tilde g = g + \delta_g$ 9 is a class-level perturbation. Rather than learning $g$ 0 directly in $g$ 1, LPG perturbs the logit gradient: $g$ 2 and induces the parameter-space perturbation through

$g$ 3

All samples in a class therefore share the same perturbation direction in logit-gradient space, while the induced perturbation in parameter space remains sample-specific because $g$ 4 is sample-dependent.

For classes assigned to positive augmentation, LPG seeks to maximize the effective gradient norm: $g$ 5 For classes assigned to negative augmentation, it minimizes the same quantity: $g$ 6

The paper gives a closed-form perturbation by restricting the perturbation direction to align with $g$ 7. Positive augmentation uses

$g$ 8

and negative augmentation uses

$g$ 9

Equivalently,

$\tilde g = g + \delta_g$ 0

The paper states that this is optimal when the perturbation is restricted to the direction of $\tilde g = g + \delta_g$ 1. Because it avoids explicit Jacobian computation, it is presented as the efficient default implementation.

A more expressive variant refines $\tilde g = g + \delta_g$ 2 with projected gradient descent. Starting from $\tilde g = g + \delta_g$ 3, the update iteratively performs projected ascent for positive augmentation and projected descent for negative augmentation, with $\tilde g = g + \delta_g$ 4 PGD steps in practice.

4. Class partitioning, perturbation bounds, and theory

LPG is task-adaptive because the partition of classes into positive and negative augmentation sets changes with the training regime (Li, 28 May 2026).

For balanced or general classification, classes are split by running average accuracy $\tilde g = g + \delta_g$ 5: $\tilde g = g + \delta_g$ 6 For long-tail classification, classes are split by frequency: $\tilde g = g + \delta_g$ 7 For noisy label learning, classes are split by intra-class gradient variance: $\tilde g = g + \delta_g$ 8 where

$\tilde g = g + \delta_g$ 9

The perturbation radius is also class-dependent: $\delta_g$ 0 where $\delta_g$ 1 denotes the splitting statistic relevant to the regime: accuracy, frequency, or gradient variance.

The paper provides two principal theoretical connections. The first is a duality between logit perturbation and gradient perturbation. It shows that Logit Perturbation Learning induces a structured logit-gradient perturbation through a mixed Hessian term, whereas LPG generalizes this by allowing arbitrary $\delta_g$ 2. Because the cross-entropy Hessian is positive semidefinite, the paper states that LPL is strictly less expressive than LPG in terms of achievable gradient-direction changes.

The second is a PAC-Bayesian generalization connection. The bound relates population risk to empirical loss under LPG, a $\delta_g$ 3 term, and a class-weighted perturbation penalty of the form

$\delta_g$ 4

The proof sketch states that class-wise gradient perturbation with bound $\delta_g$ 5 changes the parameter update by at most $\delta_g$ 6 per step, and over $\delta_g$ 7 steps the deviation from SGD is bounded by $\delta_g$ 8, with

$\delta_g$ 9

The explicit interpretation is a trade-off: larger perturbations may strengthen augmentation but also enlarge the generalization penalty.

5. Empirical regimes and reported behavior

The empirical evaluation of LPG spans balanced classification, long-tail classification, and noisy label learning, with the method presented as consistently outperforming existing baselines and as usable as a plug-in module (Li, 28 May 2026).

For balanced classification, the reported datasets are CIFAR-10 and CIFAR-100, the models are WRN-28-10 and ResNet-110, and the compared methods include CE, Label Smoothing, Mixup, ISDA, LA, LPL, SAM, Gradient Noise, and LPG variants. On CIFAR-100 with ResNet-110, the reported results include LPL (mean + varied) at 22.65%, LPG (mean + varied) at 22.28%, and LPG-PGD (mean + varied) at 22.15%. The paper interprets this as evidence that backward-chain perturbation is complementary to forward-chain perturbation.

For long-tail classification, the datasets are CIFAR-10-LT and CIFAR-100-LT, the imbalance ratios are 100:1 and 10:1, and the backbone is ResNet-32. On CIFAR-100-LT with imbalance ratio 100, LPG reaches 32.0% overall accuracy, which is reported as +3.2% over LPL and +6.1% over SAM. The stated interpretation is that amplifying tail-class gradients provides positive augmentation while preserving head-class performance.

For noisy label learning, the datasets are CIFAR-10 and CIFAR-100 with symmetric noise at rates 20%, 50%, and 80%, again using ResNet-32. At 80% noise on CIFAR-10, LPG achieves 75.5% accuracy, reported as 2.7% better than DivideMix. The paper attributes this robustness to dampening gradients of high-variance classes, which are more likely to contain corrupted labels.

The paper also emphasizes plug-in compatibility. It explicitly reports that LA + LPG improves over LA alone, LPL + LPG improves over LPL alone, and SAM + LPG improves over SAM. Because LPG modifies the backward path while LPL modifies the forward path, the two are presented as additive rather than redundant. Implementation is described as intercepting gradients at the logit layer using a custom PyTorch hook.

In decision-focused learning, a closely related development is the family of Perturbation Gradient losses, which replaces a discontinuous downstream decision loss with a finite-perturbation directional-derivative surrogate (Huang et al., 2024). In the predict-then-optimize setting,

$x_i$ 0

and the true decision-aware loss is

$x_i$ 1

The key identity is that $x_i$ 2 is the directional derivative of the plug-in value function

$x_i$ 3

along direction $x_i$ 4, via Danskin’s theorem. The paper defines two finite-difference surrogates: $x_i$ 5 Their gradients require only optimizer outputs: $x_i$ 6 The paper proves that these gradients are informative and unbiased at the sample level for the population PG objective, that the surrogates are Lipschitz and bounded, that backward PG upper-bounds the true loss, and that the approximation error vanishes as $x_i$ 7 and, with appropriate balancing, as $x_i$ 8. Its principal theoretical claim is asymptotic best-in-class policy optimality even under misspecification. This is conceptually aligned with LPG because informative gradients are obtained by perturbing the optimization problem in a task-relevant direction, but the object being learned is a decision-aware surrogate loss rather than a class-aware perturbation of neural-network gradients.

A second related framework is LPGD, or Lagrangian Proximal Gradient Descent, for training models with embedded optimization layers (Paulus et al., 2024). The setup embeds the solution of a parameterized saddle-point problem

$x_i$ 9

inside a larger model, and training minimizes losses of the form

$y_i$ 0

Because the derivative of $y_i$ 1 may be zero almost everywhere, undefined at kinks, or numerically uninformative, LPGD replaces the exact backward derivative by solving a perturbed optimization problem in the backward pass. With a linearized loss at the forward solution $y_i$ 2,

$y_i$ 3

the paper shows that if

$y_i$ 4

then the perturbed solve becomes

$y_i$ 5

The replacement gradient is formed from the difference between the perturbed and unperturbed solver outputs, scaled by $y_i$ 6. The paper frames this as gradient descent on a Lagrangian Moreau envelope, proves recovery of the true gradient as $y_i$ 7 under differentiability assumptions, and states that the method captures Blackbox Backpropagation, implicit differentiation by perturbation, Identity with Projection, SPO $y_i$ 8, and Fenchel-Young losses as special or limiting cases. In relation to LPG, LPGD embodies the same principle—perturb the problem to obtain a meaningful backward signal—but operates at the level of embedded optimization layers rather than class-conditioned gradient shaping.

7. Structured-prediction antecedents and conceptual boundaries

An important antecedent is the perturb-and-MAP line for structured probabilistic models, especially the marginal weighted maximum log-likelihood framework for Hamming and weighted Hamming losses (Shpakova et al., 2018). The model is a Gibbs distribution

$y_i$ 9

with $u_i = f(x_i; W) \in \mathbb{R}^C,$ 0. Because exact learning is hard when the log-partition $u_i = f(x_i; W) \in \mathbb{R}^C,$ 1 is intractable, the paper uses Gumbel perturbations to upper-bound the log-partition through perturb-and-MAP: $u_i = f(x_i; W) \in \mathbb{R}^C,$ 2 MAP is then solved repeatedly under sampled perturbations.

The main generalization is from ordinary likelihood to objectives aligned with Hamming and weighted Hamming loss. For Hamming loss, the objective decomposes across coordinates through marginal log-probabilities; for weighted Hamming loss, the coordinate terms are weighted by $u_i = f(x_i; W) \in \mathbb{R}^C,$ 3. This creates a loss-aware probabilistic objective that better matches evaluation metrics such as superpixel-size-weighted segmentation loss. The approximation is not a convex upper bound on the marginal likelihood; it becomes a difference of convex terms and is optimized stochastically.

Computation is the central issue because each perturbation sample induces one global perturbed MAP problem and, naively, $u_i = f(x_i; W) \in \mathbb{R}^C,$ 4 conditional perturbed MAP problems. For log-supermodular pairwise models, the paper uses dynamic graph cuts to exploit the similarity between these related inference problems. An additional acceleration reuses the same Gumbel perturbation for global and conditional terms and skips coordinates where the global perturbed MAP already matches the ground truth, because the gradient contribution cancels.

Learning is performed by double stochastic gradient descent, with stochasticity over both data minibatches and fresh Gumbel perturbations. The paper also extends the method to weak supervision, including partial labels and unlabeled data, by estimating marginals or conditional marginals from perturbed MAP samples. Experiments on OCR and HorseSeg are reported as supporting both the loss-aware modeling claim and the computational advantages of dynamic cuts and Gumbel reduction, with runtime reduced by about an order of magnitude for large numbers of iterations.

This work is best regarded as an ancestor or close relative rather than an instance of modern LPG. Its perturbations are fixed Gumbel perturbations used to approximate log-partition functions and marginals, not learned perturbations in the backward chain. The conceptual overlap is nonetheless substantial: perturbations are injected into inference, gradients are estimated from perturbed solutions, and stochastic optimization proceeds over both samples and perturbations. A plausible implication is that LPG is part of a broader perturbation-based tradition in which informative learning signals are recovered by solving nearby optimization problems when exact or naive derivatives are unavailable, degenerate, or poorly aligned with the target loss.