2000 character limit reached

SAM-GS: Similarity-Aware Gradient Surgery

Updated 5 November 2025

The paper introduces SAM-GS, which regularizes gradient aggregation in multi-task deep learning by adaptively switching between gradient equalisation and momentum modulation.
It leverages a formally defined gradient magnitude similarity metric to detect conflicts and ensure fair convergence across diverse tasks.
Experimental results on synthetic and real-world benchmarks demonstrate SAM-GS’s superior stability, fairness, and efficiency compared to conventional methods.

Similarity-Aware Momentum Gradient Surgery (SAM-GS) is an optimization methodology developed for multi-task deep learning (MTDL) that regularizes the gradient aggregation process based on gradient magnitude similarity across tasks. Its primary motivation is to resolve conflicts that arise from disparities in the magnitude of task-specific gradients, ensuring fair and efficient convergence when training a single model on multiple heterogeneous objectives. SAM-GS adaptively switches between gradient equalisation and momentum modulation, driven by a formally defined similarity metric. The approach is applicable to both synthetic multi-task optimization and real-world deep learning workloads involving multiple supervised or reinforcement learning objectives (Borsani et al., 6 Jun 2025).

1. Motivation and Problem Context

In the MTDL setting, the training process often suffers from conflicting gradients, a phenomenon where loss functions from different tasks yield gradients with either disparate magnitudes (causing some tasks to dominate updates and others to be neglected) or opposing directions. Traditional optimization schemes such as uniform averaging, static task weighting, and canonical aggregation do not resolve such gradient conflicts and can result in optimization bias, inefficient convergence, and suboptimal generalization.

Classic gradient surgery techniques (e.g., projection-based approaches) address directional conflicts but typically remain agnostic to magnitude-based conflicts, which are prevalent and increasingly problematic as the number of tasks grows. SAM-GS explicitly targets magnitude-based conflicts, using gradient magnitude similarity as its principal indicator of when corrective action is necessary.

2. Mathematical Foundations: Gradient Magnitude Similarity

SAM-GS introduces a gradient magnitude similarity measure between any two gradients $g_i, g_j \in \mathbb{R}^d$ :

$\psi(g_i, g_j) = \frac{2 \|g_i\|_2 \|g_j\|_2}{\|g_i\|_2^2 + \|g_j\|_2^2}$

This measure is symmetric, bounded within $(0,1]$ , and attains its maximum value when both gradients have identical magnitudes. For a collection of $K$ tasks, the average pairwise magnitude similarity at iteration $t$ is

$\Psi_t = \frac{1}{K^2} \sum_{i,j} \psi(g_i, g_j)$

Low $\Psi_t$ indicates pronounced magnitude disparities (i.e., conflict), while high $\Psi_t$ indicates similar gradient strengths.

SAM-GS intentionally ignores directional (angle-based) conflicts, focusing instead on the regularization of magnitude spread.

3. Mechanisms: Algorithmic Structure of SAM-GS

The SAM-GS update protocol operates in four primary stages per iteration:

Gradient Computation & Similarity Assessment: Compute gradients $g_k$ for each task $k$ , then calculate $\Psi_t$ .
Momentum (EMA) Maintenance: For each task, update momentum variable $m_{k,t}$ :

$m_{k,t} \leftarrow \beta_1 m_{k,t-1} + (1-\beta_1) g_k$

Track a similarity momentum coefficient $h_t$ using an EMA:

$h_t \leftarrow \beta_2 h_{t-1} + (1-\beta_2)(1-\Psi_t)^2 + \epsilon$

Adaptive Regularization:
- If $\Psi_t < \gamma$ (dissimilar, i.e., magnitude conflict): apply gradient equalisation
$w_k = \frac{\overline{\|g_{k}\|}_2}{\|g_k\|_2} g_k$

Here, $\overline{\|g_{k}\|}_2$ denotes the average $L_2$ -norm across tasks. - If $\Psi_t \geq \gamma$ (similar magnitude): switch to momentum modulation

$w_k = \frac{|\widehat{m_{k,t}}|}{\sqrt{\widehat{h}_t} + \epsilon}$

Bias-corrected EMAs $\widehat{m_{k,t}}, \widehat{h}_t$ are used as in Adam optimizer.
Parameter Update: Aggregate reweighted gradients for all tasks, then update parameters:

$\theta_t = \theta_{t-1} - \alpha \sum_{k=1}^{K} w_k \odot g_k$

( $\odot$ denotes element-wise multiplication.)

The logic of this scheme ensures that updates are cautious and equalized during magnitude conflict, and accelerated using momentum when conflicts are absent.

4. Regularization, Task Fairness, and Hyperparameter Effects

The central regularization effect in SAM-GS is controlled via the similarity threshold $\gamma$ . When $\Psi_t$ falls below $\gamma$ , equalization prevents any single task from dominating optimization, which in classical approaches can lead to slow convergence or poor minority-task performance. For large $\gamma$ , most updates use momentum; for very small $\gamma$ , equalisation dominates.

Ablation results demonstrate that intermediate values of $\gamma$ ($0.7$–$0.9$) yield the best results, whereas the system degenerates for $\gamma=0$ (always equalise) or $\gamma=1$ (always apply momentum). This highlights the necessity for adaptive, data-driven regularization tuned by the current task gradient statistics.

Hyperparameters ( $\beta_1, \beta_2, \alpha, \gamma, \epsilon$ ) are mostly inherited from standard optimizers (as for Adam), with $\gamma$ typically chosen via validation.

5. Experimental Evaluation and Benchmarks

SAM-GS was tested on synthetic Nash-MTL optimization problems and real-world MTDL benchmarks:

Synthetic Landscape (Nash-MTL):

SAM-GS robustly attained global optima across all tested initializations and retained performance in problems involving multiple global optima and saddle regions, outperforming LS (linear sum), CAGrad, Aligned-MTL, Nash-MTL, and PCGrad.

CityScapes (2 tasks):

SAM-GS is competitive with PCGrad and specialized angle-focused approaches.

NYU-v2 (3 tasks), CelebA (40 tasks):

SAM-GS achieves state-of-the-art or superior performance, especially as task count and gradient magnitude variability increase, underscoring its efficacy in scenarios prone to magnitude-based conflicts.

MetaWorld MT10 (Multi-task Reinforcement Learning):

Modifying aggregation using $\Psi_t$ as min, SAM-GS matches Nash-MTL and outperforms alternatives, as measured by mean ranking and mean percent improvement.

Across all evaluations, results show that SAM-GS yields superior optimization stability and fairness (mean ranking or $\Delta m\%$ improvement), particularly in configurations where classical approaches suffer from task overshadowing or inefficient learning dynamics.

6. Comparison to Other Gradient Surgery Schemes

SAM-GS belongs to the family of gradient surgery methods that explicitly manipulate gradient aggregation based on task-to-task similarity:

Method	Conflict Type	Regularization Mechanism	Element of Innovation
PCGrad	Directional	Projection on opposing direction	Negates cosine-conflict
GS-Agr/Agr-Sum	Sign	Strict consensus, zero-out	Enforces sign alignment
SAM-GS	Magnitude	Magnitude similarity/adaptive	Dynamic equalisation/momentum

SAM-GS is distinct in explicitly targeting magnitude-based gradient conflicts, whereas principal alternatives focus on sign or angle. This suggests SAM-GS is complementary to direction-aware surgeries and particularly suited to large or highly heterogeneous task constellations where magnitude disparity suppresses minority learning signals (Borsani et al., 6 Jun 2025).

7. Significance, Limitations, and Applicability

SAM-GS is modular, compatible with standard optimizers, and computationally light. It is applicable in any MTDL or multi-objective optimization setting where gradient aggregation is central. Its emphasis on magnitude similarity ensures equitable optimization progress without requiring explicit manual task weighting or specialized tuning of learning rates across tasks.

A plausible implication is that in pure domain generalization or tasks where direction conflicts dominate, angle-based methods may be more effective (as observed in some CityScapes results), while SAM-GS delivers its advantages most pronouncedly where task imbalance is due to norm disparities. Its performance is sensitive to the choice of $\gamma$ , but ablation studies reveal that moderate adaptation suffices for typical MTDL workloads.

SAM-GS thereby contributes a mathematically principled, empirically validated, and implementationally tractable approach to harmonizing multi-task optimization by leveraging similarity-aware adaptive regularization of momentum and magnitude.

PDF Markdown Chat (Pro)

References (1)

Gradient Similarity Surgery in Multi-Task Deep Learning (2025)

Follow Topic

Get notified by email when new papers are published related to Similarity-Aware Momentum Gradient Surgery (SAM-GS).