CAGrad: Conflict-Averse Gradient Descent

Updated 11 May 2026

CAGrad is a convex-optimization approach that systematically combines gradients from ensembles or tasks to avoid destructive interference.
It computes an update direction by maximizing the minimum task improvement while constraining deviation from the average gradient.
Empirical results demonstrate robust convergence in model-based optimization and multi-task learning, with tunable trade-offs via hyperparameter c.

Conflict-Averse Gradient Descent (CAGrad) is a convex-optimization–based approach for combining multiple gradient signals—typically from either model ensembles in offline model-based optimization (MBO) or multiple tasks in multi-task learning—in a manner that systematically avoids destructive interference between objectives while maintaining convergence guarantees to the average objective. The method provides a principled mechanism to interpolate between plain averaging of gradients and Pareto-optimal multi-objective updates, governed by a single hyperparameter.

1. Formulation and Objective

The CAGrad update is defined, for either an ensemble of models or multiple loss functions, by constructing an update direction that maximizes the minimum improvement across tasks/models (worst-case linear improvement), yet remains close to the average gradient. Denoting the ensemble or set of tasks by $i = 1, \ldots, m$ , and $g_i$ the individual gradients, the average gradient is $g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ . All update directions $d$ satisfy the proximity constraint $\|d - g_0\|_2 \le c \|g_0\|_2$ , with $c \in [0,1)$ controlling the level of allowable deviation from the average.

The primal optimization is: $\text{maximize}_{d \in \mathbb{R}^d, t \in \mathbb{R}} \;\; t \quad \text{subject to} \quad \langle d, g_i \rangle \ge t,\;\;\forall i;\;\; \|d - g_0\|_2 \leq c \|g_0\|_2.$

Strong duality yields an equivalent dual problem in the simplex-parameterized weight vector $w \in \Delta^m$ : $\text{minimize}_{w \succeq 0, \sum w_i=1} \;\; g_w^T g_0 + \sqrt{\phi}\|g_w\|_2,$ where $g_w = \sum_{i=1}^m w_i g_i$ and $g_i$ 0. Once $g_i$ 1 is found, the update is $g_i$ 2, or equivalently $g_i$ 3 (Kolli, 2023, Liu et al., 2021).

2. Theoretical Guarantees and Special Cases

CAGrad strictly preserves convergence to stationary points of the average objective— $g_i$ 4 or $g_i$ 5—for any $g_i$ 6. Formally, under standard Lipschitz gradient and smoothness assumptions, iterates converge to $g_i$ 7, and the convergence rate of average norm-squared-gradient is inversely proportional to $g_i$ 8 (Kolli, 2023, Liu et al., 2021).

Adjusting $g_i$ 9 allows seamless interpolation between:

$g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 0: $g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 1—the mean gradient (gradient descent/ascent)
$g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 2: updates approach maximum robustness to conflicting gradients, resembling the Multiple Gradient Descent Algorithm (MGDA)
$g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 3 (theoretical): unconstrained Pareto-optimal directions

Thus, CAGrad encompasses plain GD and MGDA as limits (Liu et al., 2021).

3. Algorithm and Implementation

A single CAGrad iteration involves:

Computing the $g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 4 individual gradients $g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 5
Calculating $g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 6 and $g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 7
Solving the dual quadratic program in $g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 8 via standard constrained optimization routines (since $g_0 = \frac{1}{m} \sum_{i=1}^m g_i$ 9 is typically small)
Recovering $d$ 0 from $d$ 1
Updating $d$ 2 (for MBO) or $d$ 3 (for MTL) via $d$ 4 or $d$ 5

This routine is computationally efficient, with overhead per iteration scaling as $d$ 6 due to the small dual variable dimension (Kolli, 2023).

Typical settings use $d$ 7 in $d$ 8; step sizes $d$ 9 are task-specific and require sufficient smallness to maintain the local approximation $\|d - g_0\|_2 \le c \|g_0\|_2$ 0 (Kolli, 2023). For continuous domains, input variables should be normalized. In discrete domains, CAGrad is applied in “soft” one-hot spaces and discretized via coordinate-wise argmax.

4. Context within Ensemble and Multi-Objective Optimization

CAGrad belongs to a family of gradient aggregation strategies for ensembles and multi-objective optimization. Key alternatives include:

Scheme	Update Direction
Mean gradient	$\\|d - g_0\\|_2 \le c \\|g_0\\|_2$ 1
Minimum gradient	$\\|d - g_0\\|_2 \le c \\|g_0\\|_2$ 2
MGDA	$\\|d - g_0\\|_2 \le c \\|g_0\\|_2$ 3
CAGrad	As described by the constrained max-min

The mean gradient is vulnerable to pathologies caused by gradient conflict, leading to oscillatory or uninformative steps. The minimum-gradient method is highly conservative—reliably reducing the worst-case but prone to instability and slow progress. MGDA achieves Pareto-stationarity but lacks explicit control over deviation from the average objective. CAGrad maintains a tunable tether to the average gradient while enforcing worst-case improvement, synthesizing robust trade-offs (Kolli, 2023).

5. Empirical Evaluation and Applications

CAGrad has been evaluated across diverse domains:

Offline Model-Based Optimization: On five black-box design tasks with ensembles of proxy models, CAGrad outperformed mean and minimum aggregation in both average and median ground-truth scores, matching MGDA on maxima, while exhibiting greater stability on discrete search spaces and faster convergence on average objectives (Kolli, 2023).
Multi-Task Learning: On synthetic, supervised vision, reinforcement learning, and semi-supervised benchmarks, CAGrad demonstrated robust balancing of average loss and worst-case improvement. In particular, it yielded minimal performance drops on tasks typically neglected by alternative methods, and achieved state-of-the-art results, e.g., $\|d - g_0\|_2 \le c \|g_0\|_2$ 483% success rate on Meta-World MT10 multi-task RL (vs. 72% for PCGrad), and improved semi-supervised accuracy by $\|d - g_0\|_2 \le c \|g_0\|_2$ 51–2% over strong baselines (Liu et al., 2021).

A plausible implication is that CAGrad is particularly effective in settings with significant gradient conflict across objectives, and where robust average performance is critical.

6. Hyperparameter Selection and Practical Considerations

The conflict-aversion hyperparameter $\|d - g_0\|_2 \le c \|g_0\|_2$ 6 occupies the range $\|d - g_0\|_2 \le c \|g_0\|_2$ 7 and modulates the average/worst-case trade-off. Lower values prioritize fidelity to the average, whereas larger (but $\|d - g_0\|_2 \le c \|g_0\|_2$ 8) values allow more assertive corrections for conflict. Theory requires $\|d - g_0\|_2 \le c \|g_0\|_2$ 9 for convergence to the average optimum; typical empirical values are in $c \in [0,1)$ 0 (Liu et al., 2021, Kolli, 2023). Step size selection parallels standard practice in gradient methods, with specific adjustment to ensure local approximation validity.

In implementation, the dual QP overhead is negligible for small $c \in [0,1)$ 1. For very large ensembles or task counts, sub-sampling or approximate dual solvers may provide substantial speed-ups with modest performance cost (Liu et al., 2021).

7. Relationship to Broader Research and Extensions

CAGrad’s design reflects the broader multi-objective optimization literature, unifying convergence to average-objective optima (as in standard GD) and Pareto-stationarity (as in MGDA) within a single convex-analytic framework. It advances beyond heuristic gradient conflict mitigation strategies by providing theoretical guarantees and a tunable, interpretable mechanism for managing gradient disagreement.

Both in ensemble MBO and multi-task learning, CAGrad operationalizes a practical balance between optimism and pessimism in the face of model or task uncertainty, leading to empirically strong and theoretically justified performance envelopes (Kolli, 2023, Liu et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

Conflict-Averse Gradient Optimization of Ensembles for Effective Offline Model-Based Optimization (2023)

Conflict-Averse Gradient Descent for Multi-task Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conflict-Averse Gradient Descent (CAGrad).

CAGrad: Conflict-Averse Gradient Descent

1. Formulation and Objective

2. Theoretical Guarantees and Special Cases

3. Algorithm and Implementation

4. Context within Ensemble and Multi-Objective Optimization

5. Empirical Evaluation and Applications

6. Hyperparameter Selection and Practical Considerations

7. Relationship to Broader Research and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CAGrad: Conflict-Averse Gradient Descent

1. Formulation and Objective

2. Theoretical Guarantees and Special Cases

3. Algorithm and Implementation

4. Context within Ensemble and Multi-Objective Optimization

5. Empirical Evaluation and Applications

6. Hyperparameter Selection and Practical Considerations

7. Relationship to Broader Research and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research