CAGrad: Conflict-Averse Gradient Descent
- CAGrad is a convex-optimization approach that systematically combines gradients from ensembles or tasks to avoid destructive interference.
- It computes an update direction by maximizing the minimum task improvement while constraining deviation from the average gradient.
- Empirical results demonstrate robust convergence in model-based optimization and multi-task learning, with tunable trade-offs via hyperparameter c.
Conflict-Averse Gradient Descent (CAGrad) is a convex-optimization–based approach for combining multiple gradient signals—typically from either model ensembles in offline model-based optimization (MBO) or multiple tasks in multi-task learning—in a manner that systematically avoids destructive interference between objectives while maintaining convergence guarantees to the average objective. The method provides a principled mechanism to interpolate between plain averaging of gradients and Pareto-optimal multi-objective updates, governed by a single hyperparameter.
1. Formulation and Objective
The CAGrad update is defined, for either an ensemble of models or multiple loss functions, by constructing an update direction that maximizes the minimum improvement across tasks/models (worst-case linear improvement), yet remains close to the average gradient. Denoting the ensemble or set of tasks by , and the individual gradients, the average gradient is . All update directions satisfy the proximity constraint , with controlling the level of allowable deviation from the average.
The primal optimization is:
Strong duality yields an equivalent dual problem in the simplex-parameterized weight vector : where and 0. Once 1 is found, the update is 2, or equivalently 3 (Kolli, 2023, Liu et al., 2021).
2. Theoretical Guarantees and Special Cases
CAGrad strictly preserves convergence to stationary points of the average objective—4 or 5—for any 6. Formally, under standard Lipschitz gradient and smoothness assumptions, iterates converge to 7, and the convergence rate of average norm-squared-gradient is inversely proportional to 8 (Kolli, 2023, Liu et al., 2021).
Adjusting 9 allows seamless interpolation between:
- 0: 1—the mean gradient (gradient descent/ascent)
- 2: updates approach maximum robustness to conflicting gradients, resembling the Multiple Gradient Descent Algorithm (MGDA)
- 3 (theoretical): unconstrained Pareto-optimal directions
Thus, CAGrad encompasses plain GD and MGDA as limits (Liu et al., 2021).
3. Algorithm and Implementation
A single CAGrad iteration involves:
- Computing the 4 individual gradients 5
- Calculating 6 and 7
- Solving the dual quadratic program in 8 via standard constrained optimization routines (since 9 is typically small)
- Recovering 0 from 1
- Updating 2 (for MBO) or 3 (for MTL) via 4 or 5
This routine is computationally efficient, with overhead per iteration scaling as 6 due to the small dual variable dimension (Kolli, 2023).
Typical settings use 7 in 8; step sizes 9 are task-specific and require sufficient smallness to maintain the local approximation 0 (Kolli, 2023). For continuous domains, input variables should be normalized. In discrete domains, CAGrad is applied in “soft” one-hot spaces and discretized via coordinate-wise argmax.
4. Context within Ensemble and Multi-Objective Optimization
CAGrad belongs to a family of gradient aggregation strategies for ensembles and multi-objective optimization. Key alternatives include:
| Scheme | Update Direction |
|---|---|
| Mean gradient | 1 |
| Minimum gradient | 2 |
| MGDA | 3 |
| CAGrad | As described by the constrained max-min |
The mean gradient is vulnerable to pathologies caused by gradient conflict, leading to oscillatory or uninformative steps. The minimum-gradient method is highly conservative—reliably reducing the worst-case but prone to instability and slow progress. MGDA achieves Pareto-stationarity but lacks explicit control over deviation from the average objective. CAGrad maintains a tunable tether to the average gradient while enforcing worst-case improvement, synthesizing robust trade-offs (Kolli, 2023).
5. Empirical Evaluation and Applications
CAGrad has been evaluated across diverse domains:
- Offline Model-Based Optimization: On five black-box design tasks with ensembles of proxy models, CAGrad outperformed mean and minimum aggregation in both average and median ground-truth scores, matching MGDA on maxima, while exhibiting greater stability on discrete search spaces and faster convergence on average objectives (Kolli, 2023).
- Multi-Task Learning: On synthetic, supervised vision, reinforcement learning, and semi-supervised benchmarks, CAGrad demonstrated robust balancing of average loss and worst-case improvement. In particular, it yielded minimal performance drops on tasks typically neglected by alternative methods, and achieved state-of-the-art results, e.g., 483% success rate on Meta-World MT10 multi-task RL (vs. 72% for PCGrad), and improved semi-supervised accuracy by 51–2% over strong baselines (Liu et al., 2021).
A plausible implication is that CAGrad is particularly effective in settings with significant gradient conflict across objectives, and where robust average performance is critical.
6. Hyperparameter Selection and Practical Considerations
The conflict-aversion hyperparameter 6 occupies the range 7 and modulates the average/worst-case trade-off. Lower values prioritize fidelity to the average, whereas larger (but 8) values allow more assertive corrections for conflict. Theory requires 9 for convergence to the average optimum; typical empirical values are in 0 (Liu et al., 2021, Kolli, 2023). Step size selection parallels standard practice in gradient methods, with specific adjustment to ensure local approximation validity.
In implementation, the dual QP overhead is negligible for small 1. For very large ensembles or task counts, sub-sampling or approximate dual solvers may provide substantial speed-ups with modest performance cost (Liu et al., 2021).
7. Relationship to Broader Research and Extensions
CAGrad’s design reflects the broader multi-objective optimization literature, unifying convergence to average-objective optima (as in standard GD) and Pareto-stationarity (as in MGDA) within a single convex-analytic framework. It advances beyond heuristic gradient conflict mitigation strategies by providing theoretical guarantees and a tunable, interpretable mechanism for managing gradient disagreement.
Both in ensemble MBO and multi-task learning, CAGrad operationalizes a practical balance between optimism and pessimism in the face of model or task uncertainty, leading to empirically strong and theoretically justified performance envelopes (Kolli, 2023, Liu et al., 2021).