Multi-Term Adam (MTAdam)

Updated 11 March 2026

Multi-Term Adam (MTAdam) is an adaptive optimization algorithm that automatically rescales gradients from different loss terms to ensure balanced influence per layer.
It uses per-term moment estimation and dynamic per-layer balancing to stabilize multi-objective training without manual loss-weight tuning.
MTAdam aggregates updates with a max-based second moment for robust step-size control, demonstrating improved performance on multi-loss benchmarks.

Multi-Term Adam (MTAdam) is an adaptive optimization algorithm designed to address the challenges of dynamically balancing multiple loss terms in deep neural network training. Unlike standard Adam, which operates on a single overall loss, MTAdam automatically rescales the magnitude of gradients arising from each distinct loss term so that their influence is balanced per network layer and throughout training. This procedure removes the need for manual loss-weight tuning, improves robustness to poor initial loss weighting, and stabilizes optimization in multi-objective and adversarial settings (Malkiel et al., 2020).

1. Motivation and Problem Statement

Modern deep learning pipelines frequently involve composite objectives of the form

$L(\theta) = \sum_{k=1}^K \lambda_k\,L_k(\theta),$

where each $L_k$ represents a distinct loss term and $\lambda_k$ are scalar loss weights. Hand-tuning the loss weights $\lambda_k$ is time-consuming, inflexible to changes during training, and may not generalize across architectures or datasets. Furthermore, different network layers can exhibit disparate sensitivity to gradients from each loss, so a single global weight $\lambda_k$ is insufficient to ensure local balance. This is especially problematic for adversarial losses or in settings like multi-task learning and conditional GANs, where the optimal tradeoff between loss components is highly dynamic. MTAdam addresses these issues by enforcing that for every network layer, all loss terms produce gradients of (roughly) equal magnitude, while still leveraging Adam's adaptive moment estimation.

2. Notation and Per-Term Moment Estimation

Let $L_1,\dots,L_K$ be the loss terms, $\theta$ the full parameter vector, $g_t^{(k)} = \nabla_\theta L_k(\theta_{t-1})$ the raw gradient for term $k$ at time $t$ , and $\theta_{t-1,i}$ the $i$ -th parameter. Layers are indexed by $\ell=1{\ldots}L$ , each being a subset of parameters.

For each loss term $k$ and parameter $i$ , MTAdam maintains separate first- and second-moment estimates analogous to those in Adam:

First moment:

$m_{t,i}^{(k)} = \beta_1 m_{t-1,i}^{(k)} + (1-\beta_1) g_{t,i}^{(k)}$

Second moment:

$v_{t,i}^{(k)} = \beta_2 v_{t-1,i}^{(k)} + (1-\beta_2) [g_{t,i}^{(k)}]^2$

Bias correction:

$\hat m_{t,i}^{(k)} = m_{t,i}^{(k)} / (1-\beta_1^t), \quad \hat v_{t,i}^{(k)} = v_{t,i}^{(k)} / (1-\beta_2^t)$

Combining all loss terms with $K=1$ (and omitting later balancing) exactly recovers Adam.

3. Per-Layer Dynamic Balancing

To compare the impact of different losses at the layer level, MTAdam computes per-layer $\ell_2$ -norms of each term's gradient:

$M_{t,\ell}^{(k)} = \|g_{t,\ell}^{(k)}\|_2$

and maintains exponentially weighted averages for each:

$n_{t,\ell}^{(k)} = \beta_3 n_{t-1,\ell}^{(k)} + (1-\beta_3) M_{t,\ell}^{(k)}.$

Initialization uses $n_{0,\ell}^{(k)}=1$ to avoid division by zero. Dynamic balancing coefficients are constructed to locally rescale each loss term so that, after rescaling, all losses' gradients have similar norm:

$\alpha_{t,\ell}^{(k)} = \frac{n_{t,\ell}^{(1)}}{n_{t,\ell}^{(k)}}, \qquad \tilde{g}_{t,\ell}^{(k)} = \alpha_{t,\ell}^{(k)} g_{t,\ell}^{(k)}.$

By anchoring to term $k=1$ , MTAdam maintains $\|\tilde{g}_{t,\ell}^{(k)}\|_2 \approx n_{t,\ell}^{(1)}$ for all $k$ at every layer $\ell$ .

4. Aggregated Parameter Update Mechanism

For each parameter, MTAdam forms a single update by aggregating the per-loss contributions. Importantly, the second moment used in the denominator is taken as the maximum across all loss terms for each parameter, providing robust step-size control in high-variance regimes:

$V_{t,i} = \max_{k=1{\ldots}K} \hat v_{t,i}^{(k)}$

and the parameter update is

$\theta_{t,i} = \theta_{t-1,i} - \eta \sum_{k=1}^K \left(\frac{\hat m_{t,i}^{(k)}}{\sqrt{V_{t,i}} + \epsilon}\right)$

or equivalently in vector form,

$\theta_t = \theta_{t-1} - \eta \sum_{k=1}^K \frac{\hat m_t^{(k)}}{\sqrt{\max_j \hat v_t^{(j)}} + \epsilon}$

This "worst-case" denominator prevents overshooting in any direction where gradients are volatile under any loss term.

5. Hyperparameter Recommendations

MTAdam reuses canonical Adam hyperparameters for ease of adoption by practitioners and compatibility with existing setups:

$\eta$ : base step size (typical Adam value, e.g., $10^{-3}$ for CNNs)
$\beta_1 = 0.9$ : first moment decay rate
$\beta_2 = 0.999$ : second moment decay rate
$\beta_3 = 0.9$ : per-layer magnitude decay rate (set to $\approx \beta_1$ )
$\epsilon = 10^{-8}$ : denominator stability

This design ensures MTAdam introduces minimal new tuning overhead beyond standard Adam settings.

6. Empirical Evaluation Results

MTAdam was evaluated on:

A controlled ten-term unbalanced MNIST classifier (with each digit as a separate loss) where randomly sampled per-class weights in $[1,1000]$ were used. MTAdam achieved $\sim 98\%$ accuracy and avoided the underfitting of low-weight classes observed with traditional Adam, RMSProp, or SGD using the same (unbalanced) weights.
Image-to-image translation (pix2pix, CycleGAN) and super-resolution (SRGAN), where MTAdam, initialized with uniform loss weights, consistently matched or outperformed the best hand-tuned Adam baselines from respective papers. Competing optimizers with unbalanced weights suffered from either mode collapse (GANs) or degraded FID, PSNR, and SSIM scores.
Ablation studies confirmed that the per-layer dynamic re-scaling, use of the first-term anchor, and the max-variance denominator are each essential to MTAdam's stability and convergence.

7. Relationship to Other Adam Variants and Extensions

MTAdam generalizes Adam by introducing per-term, per-parameter moment tracking, dynamic per-layer balancing, and a conservative variance-based denominator for the update rule (Malkiel et al., 2020). Unlike alternative Adam extensions targeting generalization via regularization or integration-based smoothing—such as Multiple Integral Adam (MIAdam) (Jin et al., 2024)—MTAdam's primary focus is on multi-term loss handling and per-layer balance rather than improved generalization through flat minima promotion or noise filtering. Both lines employ moment manipulation, but the targets (loss balance versus landscape smoothing) and underlying mechanisms differ fundamentally.

MTAdam is positioned as a universal drop-in optimizer for multi-loss deep learning scenarios, dispensing with the need for manual loss-weight searches while retaining the convergence speed and adaptivity of Adam. Its instance-specific, dynamic, and per-layer balancing framework represents a distinct methodological advance in optimization for composite deep learning objectives (Malkiel et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

MTAdam: Automatic Balancing of Multiple Training Loss Terms (2020)

A Method for Enhancing Generalization of Adam by Multiple Integrations (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Term Adam (MTAdam).