GradNorm Dynamic Loss Balancing

Updated 6 March 2026

GradNorm-based dynamic loss balancing is an adaptive technique that mitigates gradient scale imbalances in multi-task learning by dynamically reweighting task losses.
It employs mathematical foundations such as gradient norm computations and target gradient scaling to stabilize optimization across various tasks.
Empirical results on benchmarks like NYUv2 and Cityscapes demonstrate its effectiveness over traditional static weighting methods.

GradNorm-based dynamic loss balancing is an adaptive methodology for training multi-task neural networks, specifically addressing the challenge that disparate loss scales and gradient magnitudes across tasks can produce suboptimal, biased or unstable optimization. The approach operates by dynamically controlling the per-task gradients in the shared parameters’ update, either via direct normalization procedures or by adaptive weighting schemes that are responsive to real-time learning dynamics. Originating with the “GradNorm” algorithm and further extended by variants such as direct gradient normalization and hybrid loss-scale reparameterizations, these methods have become central in state-of-the-art multi-task learning (MTL) and scientific deep learning contexts.

1. Mathematical Foundations

GradNorm-based techniques target the scalarization of the multi-task objective,

$J(\theta) = \sum_{i=1}^T w_i L_i(\theta)$

where $L_i$ is the loss for task $i$ , and $w_i$ an adaptive, potentially time-dependent weight. The central idea is to balance gradients with respect to the shared parameters $\theta$ , preventing any single task from dominating training.

Classic GradNorm: For each task, compute the weighted gradient norm:

$G_i(t) = \big\| \nabla_\theta [w_i(t) L_i(\theta)] \big\|_2$

Define the relative inverse training rate:

$r_i(t) = \frac{L_i(t)}{\frac{1}{T} \sum_j L_j(t)}$

Set the target gradient for each task as:

$G_i^*(t) = \bar G(t) [r_i(t)]^{\alpha}$

where $\alpha$ tunes how aggressively slow tasks are up-weighted and $\bar G(t) = \frac{1}{T} \sum_j G_j(t)$ . The GradNorm loss,

$L_i$ 0

is minimized with respect to the weights $L_i$ 1, typically via a gradient step followed by renormalization ( $L_i$ 2) (Chen et al., 2017, Bischof et al., 2021, Xu et al., 2024).

Direct Gradient Normalization: “Dual-Balancing MTL” (DB-MTL) modifies the MTL objective by applying a log-transform to each loss:

$L_i$ 3

and replaces each per-task gradient with a version normalized to the maximal gradient norm:

$L_i$ 4

where $L_i$ 5, $L_i$ 6 is an EMA-smoothed gradient, and $L_i$ 7 (Lin et al., 2023). The shared update uses the sum of these normalized gradients.

2. Algorithmic Procedures

The GradNorm procedure requires, for each iteration:

Forward pass: compute all task losses, $L_i$ 8.
Compute the aggregate loss $L_i$ 9 using the current weights $i$ 0.
Backward pass: compute parameter gradients for $i$ 1; for classic GradNorm, also compute each $i$ 2 via separate backward passes.
Compute average loss and average gradient norm, $i$ 3, $i$ 4.
Update $i$ 5 to minimize the GradNorm loss $i$ 6.
Renormalize $i$ 7; update $i$ 8.

Pseudocode for DB-MTL (direct normalization) involves:

Forward pass: compute per-task $i$ 9 as log-transformed losses.
Compute task gradients $w_i$ 0, smooth with EMA to get $w_i$ 1.
Normalize gradients so all contribute with equal (max) norm.
Aggregate and apply parameter update with summed normalized gradients (Lin et al., 2023).

DB-MTL’s normalization is stateless (no learned weights), whereas GradNorm involves a meta-optimization each step over weights $w_i$ 2.

3. Comparative Analysis and Scope

Both original GradNorm and DB-MTL aim to prevent gradient imbalance and enable effective learning across tasks. GradNorm employs an auxiliary, data-driven subproblem for updating $w_i$ 3, which introduces computational overhead due to the per-step inner loop and the need for additional backward passes. The hyperparameter $w_i$ 4 plays a critical role in controlling the strength of adaptive reweighting; improper tuning can induce oscillation or insufficient correction of imbalance (Chen et al., 2017, Bischof et al., 2021).

DB-MTL achieves similar objectives through log-transform loss-scaling and explicit per-step gradient norm equalization, avoiding any learned weights and reducing complexity. This facilitates implementation, incurs negligible computational overhead (mainly an extra $w_i$ 5 norm calculation per task), and does not require careful tuning of meta-hyperparameters, though an EMA smoothing factor $w_i$ 6 is recommended (Lin et al., 2023).

A summary comparison is shown below:

Method	Loss Scaling	Gradient Normalization	Meta-Optimization Overhead	Adaptive Weights
GradNorm	None	Target per-task norm	Yes (weight update step)	Yes ( $w_i$ 7)
DB-MTL	Log transform	Per-iteration max norm	No	No

4. Hyperparameterization and Implementation

Original GradNorm: Key hyperparameters include $w_i$ 8 (typically in $w_i$ 9) and the step size $\theta$ 0 for $\theta$ 1 updates. Best practice is to renormalize $\theta$ 2 after each update. Empirical studies recommend $\theta$ 3 as a robust default (Chen et al., 2017, Xu et al., 2024).

DB-MTL: Uses only standard optimizer learning rate $\theta$ 4, gradient EMA smoothing $\theta$ 5 (e.g., $\theta$ 6 or adaptive decay), and a small $\theta$ 7 for numerical stability. The method is insensitive to $\theta$ 8 over a broad range and does not require loss weight hyperparameters (Lin et al., 2023).

Efficient implementation of GradNorm may exploit batched auto-differentiation and deferred $\theta$ 9 updates (e.g., every few steps) to limit computational cost (Bischof et al., 2021). Both approaches require only boundary parameter gradients for backbone updates; task-specific heads are updated via unnormalized per-task losses.

5. Empirical Results and Practical Impact

Substantial empirical evidence demonstrates the effectiveness of GradNorm-based loss balancing:

On NYUv2, classic GradNorm improved mIoU and other metrics by 3–12% over equal weighting and uncertainty-based schemes (Chen et al., 2017).
In physics-informed and PDE learning contexts, GradNorm outperforms static weights and SoftAdapt for boundary and multi-physics tasks, but can struggle when tasks with smaller gradients (e.g., fine-scale physics) are underweighted, motivating alternate normalization strategies (Xu et al., 2024, Bischof et al., 2021).
DB-MTL yields higher gains than classic GradNorm in multi-task benchmarks:
- NYUv2: DB-MTL achieves +1.15% $G_i(t) = \big\| \nabla_\theta [w_i(t) L_i(\theta)] \big\|_2$ 0 vs. GradNorm's −1.24%.
- Cityscapes: +0.20% vs. −1.55%.
- Office-31: +1.05% vs. −0.59%.
- QM9: DB-MTL error reduction −58.10% vs. GradNorm’s −227.5% (Lin et al., 2023).
Ablations indicate that both standalone gradient-norm balancing and combined log-loss transformation are beneficial, but the combined method always yields the best task-balance and overall performance (Lin et al., 2023).

6. Limitations and Contexts of Application

While GradNorm-based methods significantly outperform static schemes, they have several limitations:

The original GradNorm approach adds per-task backward passes and an inner optimization per iteration, increasing training time (Bischof et al., 2021).
For highly multiscale or physics-constrained problems, GradNorm may insufficiently weight tasks with inherently low-signature signals, leading to subpar convergence for those quantities (Xu et al., 2024).
In such multiscale scientific settings, explicit scale normalization at the loss or network-output level (e.g., via network scaling and dynamic scaling) can outperform pure gradient normalization.
Both approaches show sensitivity to task heterogeneity; tuning (especially of $G_i(t) = \big\| \nabla_\theta [w_i(t) L_i(\theta)] \big\|_2$ 1 in GradNorm) may be necessary for extreme task variance (Chen et al., 2017, Xu et al., 2024).

DB-MTL offers a robust, lightweight alternative when computational cost or ease of deployment are paramount.

GradNorm and its direct-normalization descendants have been compared with other approaches such as SoftAdapt and learning rate annealing across diverse domains including standard MTL, multi-physics PINNs, and scientific surrogate modeling (Bischof et al., 2021, Xu et al., 2024). In scientific ML, emerging evidence suggests that combining scale-aware, physics-driven normalization with adaptive gradient balancing may yield the best trade-off between stability, accuracy, and ease of use, especially as the number and scale disparity of loss terms increases.

A plausible implication is that hybrid schemes integrating explicit loss-scale normalization, automatic gradient norm balancing, and diagnostic task performance metrics may further improve the reliability and automation of multi-objective learning architectures in practical and scientific contexts (Lin et al., 2023, Xu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (4)

GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks (2017)

Multi-Objective Loss Balancing for Physics-Informed Deep Learning (2021)

Network scaling and scale-driven loss balancing for intelligent poroelastography (2024)

Dual-Balancing for Multi-Task Learning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GradNorm-based Dynamic Loss Balancing.