PCGrad: Gradient Surgery for Multi-Task Learning
- PCGrad is a gradient modification technique that projects conflicting task gradients to reduce destructive interference in multi-task learning.
- It iteratively adjusts gradients by removing antagonistic components, enhancing convergence and improving overall performance.
- Its model-agnostic design and successful application in both supervised and reinforcement tasks highlight its practical significance.
PCGrad (Projecting Conflicting Gradients) is a gradient modification technique developed for multi-task neural network training settings to directly address destructive interference between task-specific gradients. In standard multi-task learning, the aggregation of per-task gradients via simple averaging can result in suboptimal progress when gradients point in opposing directions. PCGrad applies a form of “gradient surgery” by removing the component of a task gradient that conflicts—i.e., has negative cosine similarity—with another. This mitigates the optimization challenges characteristic of multi-task learning and improves both training efficiency and generalization performance. PCGrad is model-agnostic, modifying only the gradient combination stage and thus is compatible with a variety of architectures and loss weighting schemes.
1. Multi-Task Optimization Challenges
Multi-task learning with shared representations forces neural networks to optimize multiple losses simultaneously, where each loss is associated with a distinct task. The gradients extracted from these losses may interact non-trivially:
- Conflicting Gradients: and can be anti-aligned, i.e., or ; averaging these leads to destructive interference.
- Dominating Gradients: Where one is large in magnitude relative to another, a simple sum is dominated by the larger gradient, diminishing the influence on the smaller one.
- High Curvature: Shared directions in the loss landscape can have high positive curvature, resulting in overestimated improvement for one task and underestimated degradation for another.
These phenomena frequently coexist, known in the paper as the “tragic triad.” Aggregating unaltered gradients under these conditions can cause optimization to stall or diverge for certain tasks.
2. PCGrad Algorithmic Principle
PCGrad’s primary operation is the projection of task gradients onto the normal plane of conflicting gradients. When and conflict (), the following update is applied:
This “gradient surgery” removes only the portion of that antagonizes , thereby preserving components that are constructively aligned.
The procedure iterates over tasks in a mini-batch—often shuffled randomly for each iteration—and for each pair, updates if it conflicts with . Non-conflicting pairs are unmodified. The final network update is the sum:
PCGrad avoids any modification where gradients are constructively aligned or orthogonal, ensuring that only destructive interference is mitigated.
3. Theoretical Properties and Convergence
Theoretical analysis in the convex setting establishes convergence guarantees for PCGrad. Specifically, if the learning rate is sufficiently small (, with the Lipschitz constant of the gradient), and if gradients are not exactly anti-aligned (), the PCGrad update will converge to a minimum of the aggregate loss .
Further, both sufficient and necessary conditions for PCGrad’s effectiveness involve the cosine similarity between task gradients, their relative magnitudes, and the local curvature of . In regions of high curvature and significant gradient magnitude disparity, PCGrad guarantees improved loss reduction compared to the standard summed gradient update.
4. Implementation Features
PCGrad is a modification to the gradient aggregation phase and is agnostic to the underlying model architecture. It can be directly incorporated into existing multi-task frameworks and is compatible with previously proposed multi-task loss weighting strategies. For each mini-batch during backpropagation, the following steps are executed:
- Compute task-specific gradients for all tasks.
- For each , iterate over other task gradients (possibly in random order):
- If , compute as above.
- Aggregate the modified gradients: .
- Apply parameter update.
This design ensures that PCGrad has minimal impact on overall computational costs and can leverage existing optimization infrastructure.
5. Empirical Performance
Extensive empirical evaluation demonstrates PCGrad’s benefits across both supervised and reinforcement multi-task learning problems:
- Supervised Learning: On datasets such as CityScapes, NYUv2, and MultiMNIST, PCGrad in conjunction with architectures like Multi-Task Attention Networks (MTAN) yields improved segmentation accuracy, lower depth estimation error, and better surface normal prediction compared to baselines without gradient surgery.
- Reinforcement Learning: On Meta-World’s MT10 and MT50 benchmarks, combining PCGrad with Soft Actor–Critic (SAC) achieves higher success rates and faster convergence than standalone SAC or individually trained agents.
These results suggest that PCGrad effectively improves data efficiency and final task performance in diverse multi-task scenarios.
6. Contextual Significance and Applicability
PCGrad’s model-agnostic nature permits integration with a broad spectrum of multi-task learning paradigms, including advanced architectures and loss weighting heuristics. Its reliance on only modifying conflicting gradient components maintains sharing of positive directions while preventing destructive interference. A plausible implication is that PCGrad can serve as a general-purpose solution for optimization bottlenecks in any setting where tasks share parameters and traditional gradient aggregation fails due to the “tragic triad.” Its utility spans both supervised and deep reinforcement learning domains, as evidenced by empirical studies.
PCGrad is distinguished by its simplicity of implementation and its theoretical grounding, situating it as a foundational optimization technique within the multi-task learning literature.