Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 154 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 169 tok/s Pro

GPT OSS 120B 347 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

PCGrad: Gradient Surgery for Multi-Task Learning

Updated 2 October 2025

PCGrad is a gradient modification technique that projects conflicting task gradients to reduce destructive interference in multi-task learning.
It iteratively adjusts gradients by removing antagonistic components, enhancing convergence and improving overall performance.
Its model-agnostic design and successful application in both supervised and reinforcement tasks highlight its practical significance.

PCGrad (Projecting Conflicting Gradients) is a gradient modification technique developed for multi-task neural network training settings to directly address destructive interference between task-specific gradients. In standard multi-task learning, the aggregation of per-task gradients via simple averaging can result in suboptimal progress when gradients point in opposing directions. PCGrad applies a form of “gradient surgery” by removing the component of a task gradient that conflicts—i.e., has negative cosine similarity—with another. This mitigates the optimization challenges characteristic of multi-task learning and improves both training efficiency and generalization performance. PCGrad is model-agnostic, modifying only the gradient combination stage and thus is compatible with a variety of architectures and loss weighting schemes.

1. Multi-Task Optimization Challenges

Multi-task learning with shared representations forces neural networks to optimize multiple losses $\{L_i(\theta)\}$ simultaneously, where each loss is associated with a distinct task. The gradients $\{g_i = \nabla_\theta L_i(\theta)\}$ extracted from these losses may interact non-trivially:

Conflicting Gradients: $g_i$ and $g_j$ can be anti-aligned, i.e., $g_i \cdot g_j < 0$ or $\cos\varphi_{ij} < 0$ ; averaging these leads to destructive interference.
Dominating Gradients: Where one $g_i$ is large in magnitude relative to another, a simple sum is dominated by the larger gradient, diminishing the influence on the smaller one.
High Curvature: Shared directions in the loss landscape can have high positive curvature, resulting in overestimated improvement for one task and underestimated degradation for another.

These phenomena frequently coexist, known in the paper as the “tragic triad.” Aggregating unaltered gradients under these conditions can cause optimization to stall or diverge for certain tasks.

2. PCGrad Algorithmic Principle

PCGrad’s primary operation is the projection of task gradients onto the normal plane of conflicting gradients. When $g_i$ and $g_j$ conflict ( $g_i \cdot g_j < 0$ ), the following update is applied:

$g_i^{(PC)} = g_i - \left(\frac{g_i \cdot g_j}{\lVert g_j \rVert^2}\right)g_j$

This “gradient surgery” removes only the portion of $g_i$ that antagonizes $g_j$ , thereby preserving components that are constructively aligned.

The procedure iterates over tasks in a mini-batch—often shuffled randomly for each iteration—and for each $(i,j)$ pair, updates $g_i$ if it conflicts with $g_j$ . Non-conflicting pairs are unmodified. The final network update is the sum:

$\Delta\theta = \sum_i g_i^{(PC)}$

PCGrad avoids any modification where gradients are constructively aligned or orthogonal, ensuring that only destructive interference is mitigated.

3. Theoretical Properties and Convergence

Theoretical analysis in the convex setting establishes convergence guarantees for PCGrad. Specifically, if the learning rate $t$ is sufficiently small ( $t \leq 1/L$ , with $L$ the Lipschitz constant of the gradient), and if gradients are not exactly anti-aligned ( $\cos\varphi_{12} \neq -1$ ), the PCGrad update will converge to a minimum of the aggregate loss $L(\theta) = \sum_i L_i(\theta)$ .

Further, both sufficient and necessary conditions for PCGrad’s effectiveness involve the cosine similarity between task gradients, their relative magnitudes, and the local curvature of $L$ . In regions of high curvature and significant gradient magnitude disparity, PCGrad guarantees improved loss reduction compared to the standard summed gradient update.

4. Implementation Features

PCGrad is a modification to the gradient aggregation phase and is agnostic to the underlying model architecture. It can be directly incorporated into existing multi-task frameworks and is compatible with previously proposed multi-task loss weighting strategies. For each mini-batch during backpropagation, the following steps are executed:

Compute task-specific gradients $\{g_i\}$ for all tasks.
For each $g_i$ $g_{i}$ , iterate over other task gradients $\{g_j\}_{j\neq i}$ ${g_{j}}_{j \neq = i}$ (possibly in random order):
- If $g_i \cdot g_j < 0$ , compute $g_i^{(PC)}$ as above.
Aggregate the modified gradients: $\Delta\theta = \sum_i g_i^{(PC)}$ .
Apply parameter update.

This design ensures that PCGrad has minimal impact on overall computational costs and can leverage existing optimization infrastructure.

5. Empirical Performance

Extensive empirical evaluation demonstrates PCGrad’s benefits across both supervised and reinforcement multi-task learning problems:

Supervised Learning: On datasets such as CityScapes, NYUv2, and MultiMNIST, PCGrad in conjunction with architectures like Multi-Task Attention Networks (MTAN) yields improved segmentation accuracy, lower depth estimation error, and better surface normal prediction compared to baselines without gradient surgery.
Reinforcement Learning: On Meta-World’s MT10 and MT50 benchmarks, combining PCGrad with Soft Actor–Critic (SAC) achieves higher success rates and faster convergence than standalone SAC or individually trained agents.

These results suggest that PCGrad effectively improves data efficiency and final task performance in diverse multi-task scenarios.

6. Contextual Significance and Applicability

PCGrad’s model-agnostic nature permits integration with a broad spectrum of multi-task learning paradigms, including advanced architectures and loss weighting heuristics. Its reliance on only modifying conflicting gradient components maintains sharing of positive directions while preventing destructive interference. A plausible implication is that PCGrad can serve as a general-purpose solution for optimization bottlenecks in any setting where tasks share parameters and traditional gradient aggregation fails due to the “tragic triad.” Its utility spans both supervised and deep reinforcement learning domains, as evidenced by empirical studies.

PCGrad is distinguished by its simplicity of implementation and its theoretical grounding, situating it as a foundational optimization technique within the multi-task learning literature.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to PCGrad Optimization Technique.