Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 169 tok/s Pro
GPT OSS 120B 347 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

PCGrad: Gradient Surgery for Multi-Task Learning

Updated 2 October 2025
  • PCGrad is a gradient modification technique that projects conflicting task gradients to reduce destructive interference in multi-task learning.
  • It iteratively adjusts gradients by removing antagonistic components, enhancing convergence and improving overall performance.
  • Its model-agnostic design and successful application in both supervised and reinforcement tasks highlight its practical significance.

PCGrad (Projecting Conflicting Gradients) is a gradient modification technique developed for multi-task neural network training settings to directly address destructive interference between task-specific gradients. In standard multi-task learning, the aggregation of per-task gradients via simple averaging can result in suboptimal progress when gradients point in opposing directions. PCGrad applies a form of “gradient surgery” by removing the component of a task gradient that conflicts—i.e., has negative cosine similarity—with another. This mitigates the optimization challenges characteristic of multi-task learning and improves both training efficiency and generalization performance. PCGrad is model-agnostic, modifying only the gradient combination stage and thus is compatible with a variety of architectures and loss weighting schemes.

1. Multi-Task Optimization Challenges

Multi-task learning with shared representations forces neural networks to optimize multiple losses {Li(θ)}\{L_i(\theta)\} simultaneously, where each loss is associated with a distinct task. The gradients {gi=θLi(θ)}\{g_i = \nabla_\theta L_i(\theta)\} extracted from these losses may interact non-trivially:

  • Conflicting Gradients: gig_i and gjg_j can be anti-aligned, i.e., gigj<0g_i \cdot g_j < 0 or cosφij<0\cos\varphi_{ij} < 0; averaging these leads to destructive interference.
  • Dominating Gradients: Where one gig_i is large in magnitude relative to another, a simple sum is dominated by the larger gradient, diminishing the influence on the smaller one.
  • High Curvature: Shared directions in the loss landscape can have high positive curvature, resulting in overestimated improvement for one task and underestimated degradation for another.

These phenomena frequently coexist, known in the paper as the “tragic triad.” Aggregating unaltered gradients under these conditions can cause optimization to stall or diverge for certain tasks.

2. PCGrad Algorithmic Principle

PCGrad’s primary operation is the projection of task gradients onto the normal plane of conflicting gradients. When gig_i and gjg_j conflict (gigj<0g_i \cdot g_j < 0), the following update is applied:

gi(PC)=gi(gigjgj2)gjg_i^{(PC)} = g_i - \left(\frac{g_i \cdot g_j}{\lVert g_j \rVert^2}\right)g_j

This “gradient surgery” removes only the portion of gig_i that antagonizes gjg_j, thereby preserving components that are constructively aligned.

The procedure iterates over tasks in a mini-batch—often shuffled randomly for each iteration—and for each (i,j)(i,j) pair, updates gig_i if it conflicts with gjg_j. Non-conflicting pairs are unmodified. The final network update is the sum:

Δθ=igi(PC)\Delta\theta = \sum_i g_i^{(PC)}

PCGrad avoids any modification where gradients are constructively aligned or orthogonal, ensuring that only destructive interference is mitigated.

3. Theoretical Properties and Convergence

Theoretical analysis in the convex setting establishes convergence guarantees for PCGrad. Specifically, if the learning rate tt is sufficiently small (t1/Lt \leq 1/L, with LL the Lipschitz constant of the gradient), and if gradients are not exactly anti-aligned (cosφ121\cos\varphi_{12} \neq -1), the PCGrad update will converge to a minimum of the aggregate loss L(θ)=iLi(θ)L(\theta) = \sum_i L_i(\theta).

Further, both sufficient and necessary conditions for PCGrad’s effectiveness involve the cosine similarity between task gradients, their relative magnitudes, and the local curvature of LL. In regions of high curvature and significant gradient magnitude disparity, PCGrad guarantees improved loss reduction compared to the standard summed gradient update.

4. Implementation Features

PCGrad is a modification to the gradient aggregation phase and is agnostic to the underlying model architecture. It can be directly incorporated into existing multi-task frameworks and is compatible with previously proposed multi-task loss weighting strategies. For each mini-batch during backpropagation, the following steps are executed:

  1. Compute task-specific gradients {gi}\{g_i\} for all tasks.
  2. For each gig_i, iterate over other task gradients {gj}ji\{g_j\}_{j\neq i} (possibly in random order):
    • If gigj<0g_i \cdot g_j < 0, compute gi(PC)g_i^{(PC)} as above.
  3. Aggregate the modified gradients: Δθ=igi(PC)\Delta\theta = \sum_i g_i^{(PC)}.
  4. Apply parameter update.

This design ensures that PCGrad has minimal impact on overall computational costs and can leverage existing optimization infrastructure.

5. Empirical Performance

Extensive empirical evaluation demonstrates PCGrad’s benefits across both supervised and reinforcement multi-task learning problems:

  • Supervised Learning: On datasets such as CityScapes, NYUv2, and MultiMNIST, PCGrad in conjunction with architectures like Multi-Task Attention Networks (MTAN) yields improved segmentation accuracy, lower depth estimation error, and better surface normal prediction compared to baselines without gradient surgery.
  • Reinforcement Learning: On Meta-World’s MT10 and MT50 benchmarks, combining PCGrad with Soft Actor–Critic (SAC) achieves higher success rates and faster convergence than standalone SAC or individually trained agents.

These results suggest that PCGrad effectively improves data efficiency and final task performance in diverse multi-task scenarios.

6. Contextual Significance and Applicability

PCGrad’s model-agnostic nature permits integration with a broad spectrum of multi-task learning paradigms, including advanced architectures and loss weighting heuristics. Its reliance on only modifying conflicting gradient components maintains sharing of positive directions while preventing destructive interference. A plausible implication is that PCGrad can serve as a general-purpose solution for optimization bottlenecks in any setting where tasks share parameters and traditional gradient aggregation fails due to the “tragic triad.” Its utility spans both supervised and deep reinforcement learning domains, as evidenced by empirical studies.

PCGrad is distinguished by its simplicity of implementation and its theoretical grounding, situating it as a foundational optimization technique within the multi-task learning literature.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PCGrad Optimization Technique.