RACO: Reward-Free Alignment for Conflicted Objectives

Updated 3 February 2026

The paper introduces RACO, a framework that fine-tunes models using direct preference data instead of explicit rewards for handling conflicting objectives.
It leverages conflict-averse gradient descent with per-coordinate clipping to resolve gradient conflicts and achieve efficient Pareto-critical updates.
The framework provides theoretical guarantees including sublinear regret and convergence, extending naturally to multi-group and constrained RL scenarios.

A Reward-free Alignment framework for Conflicted Objectives (RACO) provides a principled, reward-model-free approach to fine-tuning machine learning models—primarily LLMs—under multiple, potentially conflicting alignment objectives. RACO leverages direct preference data rather than explicit scalarized rewards, resolves gradient conflicts algorithmically, supports Pareto-optimal policy discovery, and extends naturally to multi-group and constrained RL settings. Its theoretical guarantees encompass sublinear regret and convergence to Pareto-critical points that respect user-specified objective weights. RACO builds on several advances including projection optimization, conflict-averse descent, direct preference optimization, and reward-free planning oracles (Chen et al., 2 Feb 2026, Xiong et al., 21 Feb 2025, Zhou et al., 2023, Miryoosefi et al., 2021).

1. Formalization: Objectives, Preference Losses, and Pareto Optimality

RACO addresses alignment in settings with $k$ conflicting objectives. Let $\theta \in \mathbb{R}^d$ denote model parameters. The $k$ preference-derived loss functions, $L_1(\theta),\ldots, L_k(\theta)$ , are computed from pairwise human preference datasets for each axis (e.g., helpfulness, harmlessness, conciseness, factuality). The gradient of each objective is $g_i(\theta) = \nabla_\theta L_i(\theta)$ .

A parameter point $\theta^*$ is Pareto-optimal if there does not exist a common descent direction $\mathbf{d}$ such that for all $i$ , $\nabla L_i(\theta^*)^\top \mathbf{d} < 0$ , i.e., no infinitesimal update will improve all objectives simultaneously. User trade-off preferences are encoded as weights $w = (w_1,\ldots,w_k) \in \Delta_k$ , with the anchor gradient $g_0(\theta) = \sum_{i=1}^k w_i g_i(\theta)$ corresponding to the weighted aggregate loss $L_w(\theta) = \sum_{i=1}^k w_i L_i(\theta)$ . RACO seeks points that are Pareto-critical and that align with these weights (Chen et al., 2 Feb 2026).

2. Conflict-Averse Gradient Descent with Clipping

RACO's core update mechanism is a clipped variant of conflict-averse gradient descent (CAGrad). Two objectives are in conflict at $\theta$ if their gradients are negatively aligned, i.e., $\nabla L_i(\theta) \cdot \nabla L_j(\theta) < 0$ . At each iteration $t$ , CAGrad solves for a mixing vector $p^{(t)} \in \Delta_k$ that minimizes $G_p^{(t)\top} g_0^{(t)} + c \|g_0^{(t)}\| \|G_p^{(t)}\|$ , where $G_p^{(t)} = \sum_{i=1}^k p_i g_i^{(t)}$ and $c \in [0,1)$ is the correction hyperparameter.

The descent direction without clipping is $-g_0^{(t)} - c \|g_0^{(t)}\| G_p^{(t)}/\|G_p^{(t)}\|$ . To prevent excessive influence from low-weight objectives, RACO applies per-coordinate clipping: $\tilde{p}_i = \min(p_i, w_i)$ , forming the clipped mixture $\hat G_p^{(t)} = \sum_{i=1}^k \tilde{p}_i g_i^{(t)}$ . The final update is $G_0^{(t)} = g_0^{(t)} + c\|g_0^{(t)}\| \hat G_p^{(t)}/\|\hat G_p^{(t)}\|$ (if $\|\hat G_p^{(t)}\| > 0$ ) and the parameter update is $\theta_{t+1} = \theta_t - \eta G_0^{(t)}$ (Chen et al., 2 Feb 2026).

This approach ensures that each update direction accounts for gradient conflicts without violating user intent, and the clipped correction strictly improves per-step descent in two-objective scenarios.

3. Theoretical Guarantees and Convergence Properties

Under standard assumptions— $\ell_i$ -Lipschitz gradients for each objective, appropriate stepsize $\eta \in (0, 1/\ell_w]$ with $\ell_w = \sum w_i \ell_i$ , and $c \in [0,1)$ —RACO iterates converge to Pareto-critical points. Specifically, every limit point $\theta^\star$ is both a stationary point of $L_w$ and Pareto-critical for $(L_1,\ldots,L_k)$ . The sublinear convergence rate in terms of the squared gradient norm is

$\min_{0 \leq t < T} \|\nabla L_w(\theta_t)\|^2 \leq \frac{2 L_w(\theta_0)}{\eta(1-c^2) T}.$

A corresponding bound for the Pareto-criticality measure $M(\theta) = \min_{\lambda \in \Delta_k} \| \sum \lambda_i \nabla L_i(\theta) \|$ applies, since $M(\theta) \leq \|\nabla L_w(\theta)\|$ .

For $k=2$ , explicit analysis shows clipped CAGrad strictly accelerates per-step descent whenever the gradients are not colinear and clipping is active (Chen et al., 2 Feb 2026).

4. Projection Optimization and Reward-Free Extensions

Alternative RACO instantiations leverage Blackwell approachability. The projections-based RACO method reformulates alignment as iteratively steering the vector of axis-wise utilities $S(\pi) = (f_1(\pi),\ldots,f_m(\pi))^\top$ into a convex target set $W_{p,c}^\alpha$ , parameterized by the preference aggregation function $\Phi(S) = \left(\sum_i \alpha_i S_i^p\right)^{1/p}$ with $p \leq 1$ , $\alpha \in \Delta_{m-1}$ , and a threshold $c$ .

At each round, RACO computes the projection direction $d^{t+1} = \operatorname{ProjDir}(\overline V^t; W)$ , then optimizes the next sub-problem $\pi^{t+1} \leftarrow \arg\max_\pi \sum_{i=1}^m d_i^{t+1} f_i(\pi)$ . The final policy is the mixture $\bar{\pi}_T = \frac{1}{T} \sum_{t=1}^T \pi^t$ .

Crucially, these sub-problems use only linear combinations of the base objectives—never forming the aggregate reward explicitly—permitting the use of DPO-style updates and eliminating the need to train or query explicit reward models at any stage (i.e., fully reward-free operation). Under standard linearity and boundedness conditions, RACO achieves regret $R_T = O(1/\sqrt{T})$ (Xiong et al., 21 Feb 2025).

5. Extensions: Multi-Group Consensus and Constrained RL Connections

RACO generalizes to multi-group alignment by compositing convex targets $W^{(n)}$ for each user group $n$ with their own aggregation parameters. Consensus policies are obtained either by projecting into $W = \cap_n W^{(n)}$ or by minimizing a composite malfare (e.g., $(\sum_n \zeta_n d(S(\pi), W^{(n)})^{2q})^{1/(2q)}$ ), updating projection steps accordingly. If group aggregation parameters are not known, maximum likelihood estimation interleaved with policy updates can be used. All regret and convergence guarantees extend, up to $O(1/\sqrt{T})$ error from weight estimation (Xiong et al., 21 Feb 2025).

In the RL context, reward-free constrained RL or approachability can be reduced to the RACO meta-algorithm: reward-free exploration phase (dynamics model) plus an online convex optimization loop enforcing the constraints or driving the vectorial return to a target set. The separation of exploration (reward-free oracle) and constraint satisfaction (Fenchel dual/OCO) guarantees sample optimality. For tabular and linearly-parameterized MDPs, this achieves the best known or matching sample complexity rates for constrained RL and zero-sum Markov games (Miryoosefi et al., 2021).

6. Algorithmic Instantiations and Practical Considerations

RACO implementations can take several algorithmic forms:

Direct CAGrad-Clip updates for LLM alignment with pairwise preference losses (DPO objectives) and standard gradient-based optimization. Clipping parameters $c$ are typically tuned per model and task (e.g., $c=0.4$ for Qwen/Llama-Instruct, $c=0.7$ –$0.8$ for Gemma) (Chen et al., 2 Feb 2026).
Projection Optimization with DPO-style decoding: when individual policies for each objective have been trained to convergence, RACO’s mixture is computed in closed form via MOD decoding, $π^t(y|x) \propto \prod \pi_{r_i}(y|x)^{d_i^t}$ , leading to nearly training-free policy integration (Xiong et al., 21 Feb 2025).
MODPO (Multi-Objective Direct Preference Optimization) as a reward-free surrogate: policies are trained directly using a multi-head DPO loss over all objectives, optionally parameterized by weight prompts to enable smooth interpolation on the Pareto frontier. The approach is computationally efficient and supports single-model coverage of the trade-off surface (Zhou et al., 2023).

Common empirical heuristics include gradient normalization/clipping, leveraging standard TRL library practices, and batch size/sequencing optimization to stabilize large-model training (Chen et al., 2 Feb 2026).

7. Empirical Performance and Limitations

Across benchmarks in multi-objective summarization (Reddit quality-conciseness, quality-faithfulness) and safety alignment (helpfulness-harmlessness), RACO consistently yields wider, more Pareto-optimal trade-offs compared to baseline weighted-loss schemes (DPO-LW, AMoPO) in Qwen, Llama, and Gemma families. In particular, on the BeaverTails safety alignment set, RACO's Pareto fronts dominate alternatives, attaining win rates of 50–80% against existing multi-objective methods at various weights (Chen et al., 2 Feb 2026).

RACO operates on static preference data and is subject to batching and gradient efficiency at high objective counts. Open directions include online/active preference acquisition, scaling to large $k$ , adaptive tuning of algorithmic hyperparameters, and rigorous extension of convergence acceleration beyond the two-objective regime.

In summary, RACO constitutes a theoretically sound, practically viable framework for multi-objective and multi-group alignment, operating without explicit reward models and with robust convergence to user-weighted Pareto-critical solutions. Its conflict resolution, projection optimization, and reward-free planning design are supported by both rigorous theoretical analysis and extensive empirical validation (Chen et al., 2 Feb 2026, Xiong et al., 21 Feb 2025, Zhou et al., 2023, Miryoosefi et al., 2021).