PCGrad: Resolving Gradient Conflicts in MTL

Updated 7 May 2026

The paper demonstrates that PCGrad directly projects conflicting gradients to prevent negative interference between tasks, enhancing convergence and test accuracy.
PCGrad offers a hyperparameter-free and computationally efficient solution that requires only minor modifications to standard gradient aggregation methods.
Empirical studies reveal that PCGrad significantly improves performance across supervised, reinforcement, and physics-informed learning, achieving notable accuracy and speed gains.

Projecting Conflicting Gradients (PCGrad) is a gradient manipulation scheme designed to address the optimization challenges inherent in multi-task learning and composite neural optimization objectives. The method operates by identifying and resolving destructive interference between gradients associated with different task-specific loss terms, directly intervening in the standard gradient aggregation mechanism to promote constructive update directions and mitigate negative transfer. PCGrad was first introduced in the context of deep multi-task learning and has since found applications across supervised, reinforcement, and physics-informed learning paradigms (Yu et al., 2020, Zhou et al., 2021, Bohn et al., 2024, Xiao et al., 16 Apr 2026).

1. Motivation and Theoretical Foundations

Multi-task learning (MTL) and multi-loss frameworks, such as physics-informed neural networks (PINNs), aggregate loss terms $\{\mathcal{L}_i\}_{i=1}^K$ corresponding to different objectives, constraints, or tasks. Optimization traditionally proceeds via summing the gradients $g = \sum_i \nabla_\theta \mathcal{L}_i$ . However, when individual gradients are (i) highly imbalanced in magnitude or (ii) oriented in conflicting directions (i.e., negative cosine similarity), naïve aggregation leads to destructive interference. This results in oscillations, slow convergence, suboptimal solutions, and the phenomenon of negative transfer, where improvement in one task comes at the expense of another (Yu et al., 2020, Zhou et al., 2021).

Theoretical analysis shows that when task gradients $g_i$ and $g_j$ satisfy

$\omega(g_i, g_j) = \frac{g_i \cdot g_j}{\|g_i\|\|g_j\|} < 0,$

inter-task gradient conflict occurs, violating aligned descent for the shared parameter vector (Yu et al., 2020). Empirical studies further reveal that such conflicts are both common and detrimental across domains, motivating algorithmic intervention.

2. Mathematical Formulation and Algorithm

PCGrad operates by "surgically" removing components of a task's gradient that directly oppose other task gradients. The core update for a pair of conflicting gradients is: $g_i \longleftarrow g_i - \frac{g_i \cdot g_j}{\|g_j\|^2} g_j, \quad \text{if } g_i \cdot g_j < 0$ where $g_i$ and $g_j$ are gradients of task $i$ and $j$ with respect to the shared model parameters. This projection ensures that, post-modification, the updated $g = \sum_i \nabla_\theta \mathcal{L}_i$ 0 is non-conflicting with $g = \sum_i \nabla_\theta \mathcal{L}_i$ 1.

The PCGrad algorithm proceeds as follows (Yu et al., 2020, Zhou et al., 2021, Bohn et al., 2024):

For each task $g = \sum_i \nabla_\theta \mathcal{L}_i$ 2, initialize $g = \sum_i \nabla_\theta \mathcal{L}_i$ 3.
Randomly order the tasks; for each $g = \sum_i \nabla_\theta \mathcal{L}_i$ $g = \sum_{i} \nabla_{θ} L_{i}$ 4:
- Randomly select another task $g = \sum_i \nabla_\theta \mathcal{L}_i$ 5.
- If $g = \sum_i \nabla_\theta \mathcal{L}_i$ 6, project $g = \sum_i \nabla_\theta \mathcal{L}_i$ 7 onto the normal plane of $g = \sum_i \nabla_\theta \mathcal{L}_i$ 8 using the formula above.
Form the aggregated update $g = \sum_i \nabla_\theta \mathcal{L}_i$ 9.
Update parameters: $g_i$ 0, where $g_i$ 1 is any base optimizer.

PCGrad is hyperparameter-free, requires only minor modification to the optimizer's gradient aggregation step, and is compatible with any first-order optimization method (Yu et al., 2020, Zhou et al., 2021).

3. Implementation and Computational Considerations

PCGrad introduces minimal computational overhead. For $g_i$ 2 tasks, each iteration requires $g_i$ 3 backward passes (unless gradient sharing is used) and $g_i$ 4 inner products per projection. For small $g_i$ 5 (e.g., $g_i$ 6 or $g_i$ 7 tasks), such as multi-component physics-informed losses or asymmetric two-task setups (LLM unlearning), the cost is negligible (Zhou et al., 2021, Xiao et al., 16 Apr 2026). The framework is readily integrated with standard neural optimization libraries without additional hyperparameters.

In "A generic physics-informed neural network-based framework for reliability assessment of multi-state systems" (Zhou et al., 2021), PCGrad is used with PINNs where $g_i$ 8 or $g_i$ 9 (M denotes the number of ODE residuals). In large-scale multi-task vision or RL (e.g., MT10, MT50), the overhead remains manageable by batching and vectorized dot-product computation (Yu et al., 2020, Bohn et al., 2024).

4. Empirical Performance and Applications

PCGrad yields consistent performance improvements in supervised learning, reinforcement learning, physics-informed learning, and LLM unlearning:

Supervised Learning: On CIFAR-100 (20 tasks), PCGrad raises average test accuracy from 67.7% (single-task) to 77.5% (routing nets + PCGrad) (Yu et al., 2020). On NYUv2 (3 tasks), it improves mean IoU and pixel accuracy, achieving best-in-class metrics for multi-task vision backbones.
Reinforcement Learning: In the Meta-World MT10/MT50 suite, SAC with PCGrad achieves 100%/70% multi-task success rates with significantly fewer samples than independent training (Yu et al., 2020).
Physics-Informed Learning: For PINN-based reliability assessment, RMSE is reduced by up to 96.6% on a 12-state system when incorporating PCGrad, and convergence accelerates by an order of magnitude in iteration count (Zhou et al., 2021).
Unlearning in LLMs: In asymmetric two-task setups (retention vs. forgetting), module-wise PCGrad projections increase retention performance (e.g., MMLU recovery from 25.1% to 53.0%) at matched forgetting strength, shifting solutions toward the Pareto frontier (Xiao et al., 16 Apr 2026).

5. Extensions, Variants, and Theoretical Insights

PCGrad's pairwise projection mechanism can be generalized:

Weighted PCGrad (wPCGrad): Task projection order is made probabilistically dependent on task priority or loss, allowing adaptive focus on underperforming or high-loss tasks (Bohn et al., 2024). This yields further performance gains in datasets such as nuScenes, CIFAR-100, and CelebA.
Module-Wise and Layer-Wise PCGrad: Fine-grained projection is applied at the module or layer level (e.g., for LLM unlearning), improving granularity and empirical retention (Xiao et al., 16 Apr 2026).
Algorithmic Hybrids: PCGrad can be combined with dynamic weighting schemes such as GradNorm or incorporated alongside global cone-based constraints (ConicGrad) and higher-order subspace projections (GradOPS) to navigate multi-objective trade-offs (Hassanpour et al., 31 Jan 2025, Zhu et al., 5 Mar 2025).

Theoretical results guarantee that, post-projection, the aggregated gradient remains a valid descent direction for the combined loss. For two-task convex problems, PCGrad guarantees convergence to either an optimum or a saddle point where gradients are exactly opposed. In the nonconvex regime, removing only the destructive components prevents regressive interference, stabilizes joint descent, and empirically supports faster convergence (Yu et al., 2020).

6. Limitations and Potential Directions

PCGrad assumes task equality—projections are performed only when direct conflict is present and without explicit re-weighting of loss scales. Scenarios with very large $g_j$ 0 confront increased computational cost, advocating for sampling or layer-wise approximations (Zhou et al., 2021, Yu et al., 2020). PCGrad does not enforce full strong non-confliction as in GradOPS, nor does it solve a global max–min as in ConicGrad, so it is possible for PCGrad solutions to remain suboptimal with respect to some Pareto objectives (Hassanpour et al., 31 Jan 2025, Zhu et al., 5 Mar 2025). Further extensions combine projection-based conflict resolution with adaptive weighting, meta-learned task prioritization, or global geometric constraints.

Method	Principle	Computational Complexity per Step	Trade-off Control
PCGrad	Pairwise conflict projection	$g_j$ 1	Implicit; projection only
wPCGrad	Weighted conflict projection	$g_j$ 2	Adaptive anchor selection
GradOPS	Subspace orthogonal projection	$g_j$ 3	$g_j$ 4 parameter (trade-off)
ConicGrad	Cone-constrained max–min solution	$g_j$ 5 (via SMW)	Cone width $g_j$ 6

Here, $g_j$ 7 is the number of tasks and $g_j$ 8 the parameter count. PCGrad provides a practical, model-agnostic, hyperparameter-free deconfliction strategy effective for a broad range of multi-objective learning problems, with theory and empirical results validating substantial improvements in accuracy, convergence, and optimization stability (Yu et al., 2020, Zhou et al., 2021, Bohn et al., 2024, Hassanpour et al., 31 Jan 2025, Zhu et al., 5 Mar 2025, Xiao et al., 16 Apr 2026).