Papers
Topics
Authors
Recent
2000 character limit reached

GradNorm-Based Dynamic Loss Balancing

Updated 26 November 2025
  • The paper introduces a method that dynamically adapts loss weights by equalizing gradient norms, ensuring balanced training progress across tasks.
  • GradNorm-based dynamic loss balancing is defined as an algorithm that adapts weights in multi-task deep learning, particularly in PINNs and PDE-constrained problems, using network scaling and physical normalization.
  • Empirical results show that while GradNorm outperforms static methods, its performance significantly improves when combined with explicit normalization to handle extreme loss scale disparities.

GradNorm-based dynamic loss balancing refers to a class of algorithms that adaptively reweight multiple loss components in deep neural network training, with the specific goal of maintaining balanced optimization progress across tasks. These methods are especially relevant in multi-task and multi-physics contexts, including physics-informed neural networks (PINNs) and multi-objective PDE-constrained learning, where loss terms may differ by several orders of magnitude and simple static weighting schemes fail to achieve satisfactory convergence or generalization. GradNorm is the archetypal algorithm in this class, dynamically tuning loss weights to equalize per-task gradient magnitudes, often in conjunction with network normalization or scaling layers to further address divergent loss scales. Recent developments further integrate explicit physical scaling and scale normalization to yield robust, principled loss balancing across highly heterogeneous domains.

1. Motivation and Problem Formulation

In multi-task learning, PINNs, and multi-physics models, the training objective takes the form of a weighted sum of NN losses,

L(θ)=i=1Nwi(t)Li(θ),L(\theta) = \sum_{i=1}^N w_i(t) L_i(\theta),

where θ\theta are network parameters, LiL_i is the iith task or physics loss (e.g., data misfit, PDE residual, boundary/initial condition), and wi(t)w_i(t) is a task weight, possibly time-dependent or adaptive. The individual losses LiL_i may arise from fundamentally distinct physical laws, observables, or constraints, and thus exhibit non-comparable scales and gradients. Unbalanced losses often lead to premature overfitting, stagnation, or domination of one task at the expense of others. In poroelastography, for instance, LiL_i may correspond to real/imaginary components of Biot momentum/mass PDEs, and material parameters may span several decades in value (Xu et al., 27 Oct 2024).

2. GradNorm: Algorithmic Principles

GradNorm, introduced by Chen et al. (Chen et al., 2017), automates loss weighing by equalizing the relative training rates of all loss components via dynamic adaptation of wiw_i. The approach includes:

  • Relative training rate: For each task, define the normalized inverse rate,

ri(t)=Li(t)/Li(0)j=1NLj(t)/Lj(0)r_i(t) = \frac{L_i(t)/L_i(0)}{\sum_{j=1}^N L_j(t)/L_j(0)}

where Li(0)L_i(0) is the initial value for normalization.

  • Gradient norm computation: For every weighted task, compute the gradient norm with respect to shared parameters w\mathbf{w},

Gi(t)=w[wi(t)Li(t)]2.G_i(t) = \left\|\nabla_\mathbf{w}[w_i(t) L_i(t)]\right\|_2.

The mean gradient norm is Gˉ(t)=1Ni=1NGi(t)\bar{G}(t) = \frac{1}{N} \sum_{i=1}^N G_i(t).

  • Target matching and auxiliary loss: For a hyperparameter α>0\alpha > 0,

Gi(t)=Gˉ(t)[ri(t)]α,G_i^*(t) = \bar{G}(t) \cdot [r_i(t)]^\alpha,

and define the auxiliary loss,

L(t)=i=1NGi(t)Gi(t).L_{\nabla}(t) = \sum_{i=1}^N \left| G_i(t) - G_i^*(t) \right|.

  • Weight update: Update wiw_i by descending LL_\nabla,

wiwiηwLwi,w_i \leftarrow w_i - \eta_w \frac{\partial L_\nabla}{\partial w_i},

followed by renormalization iwi=N\sum_i w_i = N to preserve loss scale.

The algorithm ensures that tasks with slower progress (higher rir_i) receive greater gradient magnitude, thereby accelerating their learning; rapidly converging or over-represented tasks have wiw_i reduced.

3. Integration with Network Scaling and Physical Normalization

A key challenge identified in multi-physics and multi-scale problems is that gradient-based balancing alone is insufficient when loss terms are separated by large intrinsic scale differences (Xu et al., 27 Oct 2024). To address this, the network scaling approach represents each property map θϑθ \to ϑ as the product of a unit shape function learned by an MLP (with O(1)\mathcal{O}(1) weights) and a scaling factor sms_m tailored to the physical property (e.g., permeability, shear modulus). The final output for each property is

(ϑ1,...,ϑ6)T=diag(s1,...,s6)xN1,(ϑ_1, ..., ϑ_6)^T = \mathrm{diag}(s_1, ..., s_6) x^{N_\ell-1},

with sms_m selected from plausible physical scales. This architectural normalization stabilizes parameter magnitudes, enforces explicit correspondence between network output scales and physical reality, and underpins explicit scale estimation for each LiL_i and its gradient, enabling fair balancing by GradNorm or related methods.

Dynamic scaling (termed "DynScl" in (Xu et al., 27 Oct 2024)) further extends this by analytically setting the weights to equalize both the scale and Lipschitz constants of all loss terms, automatically normalizing derivatives even before dynamic balancing.

4. Implementation Workflows and Algorithmic Details

A canonical training loop for GradNorm-based loss balancing includes:

  1. Forward computation of each LiL_i and total loss L=iwiLiL = \sum_i w_i L_i.
  2. Backward pass to update standard network parameters θ\theta.
  3. Computation of per-loss weighted gradients, GiG_i.
  4. Calculation of normalized rates rir_i and targets GiG_i^*.
  5. Formulation and backward computation of LL_\nabla, the auxiliary loss over ww.
  6. Gradient step on ww, followed by normalization.

By contrast, algorithms such as SoftAdapt set wiw_i using a softmax over recent loss decrease rates, but do not reference gradient norms or physical scale, resulting in heuristic rather than principled balancing (Xu et al., 27 Oct 2024).

5. Empirical Evidence and Comparative Performance

The effectiveness of GradNorm, especially when combined with network scaling, has been evaluated in a variety of test cases (Xu et al., 27 Oct 2024, Chen et al., 2017, Bischof et al., 2021). Notable outcomes include:

Method Max Rel. Error (High-Permeability Region) Observed Characteristics
Equal Weights 100% Divergence or stagnation without scaling
SoftAdapt 9.8% Unstable; some parameter estimates diverge
GradNorm (GN) 18.4% O(1) weights; some tasks remain unconverged
Dynamic Scaling 2.2% Near-uniform convergence, low final error

These results demonstrate that network scaling is essential: without architectural output normalization, all balancing schemes fail. GradNorm outperforms static weights but can leave some tasks unconverged. Purely physics-driven scaling (dynamic scaling) yields the most consistent and robust accuracy, with all sublosses decaying at matched rates and parameter errors 2%\leq 2\% (Xu et al., 27 Oct 2024).

In PINN benchmarks (Bischof et al., 2021), GradNorm provides reliable training progress balancing for moderate numbers of comparably scaled losses but incurs extra computational overhead (e.g., \sim130 seconds per 1000 steps for 9 loss terms). Performance deteriorates when loss components differ by >103>10^3 in scale, unless combined with additional normalization.

6. Strengths, Limitations, and Practical Guidance

GradNorm-based dynamic loss balancing offers principled adaptation without ad hoc hyperparameter search for weights:

  • Strengths:
  • Limitations:
    • Computational cost increases linearly with the number of losses.
    • Efficacy diminishes in the presence of extreme loss scale imbalance unless architectural normalization (network scaling) or explicit physics-based weights ("dynamic scaling") are applied (Xu et al., 27 Oct 2024, Bischof et al., 2021).
    • Requires tuning of an additional hyperparameter α\alpha (restoring force).
    • In larger-scale problems or with highly heterogeneous task scaling, lighter-weight or analytic normalization methods may outperform GradNorm.
  • Practical advice:
    • Initialize all wi=1w_i=1 and set α[0.5,2.0]\alpha\in[0.5,2.0]; use lower values if loss scales differ by several orders of magnitude.
    • Employ network scaling to ensure all outputs, derivatives, and losses operate on comparable numerical scales.
    • When N10N \gtrsim 10, consider updating wiw_i at reduced frequency to control overhead (Bischof et al., 2021).

7. Synthesis and Outlook

GradNorm-based dynamic loss balancing combines adaptive gradient norm matching with, increasingly, explicit scale normalization at both the network and loss levels. This synergy is particularly effective in multi-physics and multi-objective regimes with intrinsic scale disparity, as in poroelastography and PINNs. While GradNorm remains a leading method for dynamic adaptation across tasks, empirical evidence is unequivocal that its real-world applicability hinges on normalization—either via neural architectural design (network scaling), analytic physical weighting, or a hybrid. In scenarios where losses are comparable and the number of tasks is moderate, GradNorm provides an automated and robust solution. With increasing complexity, direct analytic scale-matching or lighter-weight schemes become essential to maintain convergence, stability, and computational efficiency (Xu et al., 27 Oct 2024, Bischof et al., 2021).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Gradnorm-based Dynamic Loss Balancing.