Papers
Topics
Authors
Recent
Search
2000 character limit reached

RACO: Reward-Free Alignment for Conflicted Objectives

Updated 3 February 2026
  • The paper introduces RACO, a framework that fine-tunes models using direct preference data instead of explicit rewards for handling conflicting objectives.
  • It leverages conflict-averse gradient descent with per-coordinate clipping to resolve gradient conflicts and achieve efficient Pareto-critical updates.
  • The framework provides theoretical guarantees including sublinear regret and convergence, extending naturally to multi-group and constrained RL scenarios.

A Reward-free Alignment framework for Conflicted Objectives (RACO) provides a principled, reward-model-free approach to fine-tuning machine learning models—primarily LLMs—under multiple, potentially conflicting alignment objectives. RACO leverages direct preference data rather than explicit scalarized rewards, resolves gradient conflicts algorithmically, supports Pareto-optimal policy discovery, and extends naturally to multi-group and constrained RL settings. Its theoretical guarantees encompass sublinear regret and convergence to Pareto-critical points that respect user-specified objective weights. RACO builds on several advances including projection optimization, conflict-averse descent, direct preference optimization, and reward-free planning oracles (Chen et al., 2 Feb 2026, Xiong et al., 21 Feb 2025, Zhou et al., 2023, Miryoosefi et al., 2021).

1. Formalization: Objectives, Preference Losses, and Pareto Optimality

RACO addresses alignment in settings with kk conflicting objectives. Let θRd\theta \in \mathbb{R}^d denote model parameters. The kk preference-derived loss functions, L1(θ),,Lk(θ)L_1(\theta),\ldots, L_k(\theta), are computed from pairwise human preference datasets for each axis (e.g., helpfulness, harmlessness, conciseness, factuality). The gradient of each objective is gi(θ)=θLi(θ)g_i(\theta) = \nabla_\theta L_i(\theta).

A parameter point θ\theta^* is Pareto-optimal if there does not exist a common descent direction d\mathbf{d} such that for all ii, Li(θ)d<0\nabla L_i(\theta^*)^\top \mathbf{d} < 0, i.e., no infinitesimal update will improve all objectives simultaneously. User trade-off preferences are encoded as weights w=(w1,,wk)Δkw = (w_1,\ldots,w_k) \in \Delta_k, with the anchor gradient g0(θ)=i=1kwigi(θ)g_0(\theta) = \sum_{i=1}^k w_i g_i(\theta) corresponding to the weighted aggregate loss Lw(θ)=i=1kwiLi(θ)L_w(\theta) = \sum_{i=1}^k w_i L_i(\theta). RACO seeks points that are Pareto-critical and that align with these weights (Chen et al., 2 Feb 2026).

2. Conflict-Averse Gradient Descent with Clipping

RACO's core update mechanism is a clipped variant of conflict-averse gradient descent (CAGrad). Two objectives are in conflict at θ\theta if their gradients are negatively aligned, i.e., Li(θ)Lj(θ)<0\nabla L_i(\theta) \cdot \nabla L_j(\theta) < 0. At each iteration tt, CAGrad solves for a mixing vector p(t)Δkp^{(t)} \in \Delta_k that minimizes Gp(t)g0(t)+cg0(t)Gp(t)G_p^{(t)\top} g_0^{(t)} + c \|g_0^{(t)}\| \|G_p^{(t)}\|, where Gp(t)=i=1kpigi(t)G_p^{(t)} = \sum_{i=1}^k p_i g_i^{(t)} and c[0,1)c \in [0,1) is the correction hyperparameter.

The descent direction without clipping is g0(t)cg0(t)Gp(t)/Gp(t)-g_0^{(t)} - c \|g_0^{(t)}\| G_p^{(t)}/\|G_p^{(t)}\|. To prevent excessive influence from low-weight objectives, RACO applies per-coordinate clipping: p~i=min(pi,wi)\tilde{p}_i = \min(p_i, w_i), forming the clipped mixture G^p(t)=i=1kp~igi(t)\hat G_p^{(t)} = \sum_{i=1}^k \tilde{p}_i g_i^{(t)}. The final update is G0(t)=g0(t)+cg0(t)G^p(t)/G^p(t)G_0^{(t)} = g_0^{(t)} + c\|g_0^{(t)}\| \hat G_p^{(t)}/\|\hat G_p^{(t)}\| (if G^p(t)>0\|\hat G_p^{(t)}\| > 0) and the parameter update is θt+1=θtηG0(t)\theta_{t+1} = \theta_t - \eta G_0^{(t)} (Chen et al., 2 Feb 2026).

This approach ensures that each update direction accounts for gradient conflicts without violating user intent, and the clipped correction strictly improves per-step descent in two-objective scenarios.

3. Theoretical Guarantees and Convergence Properties

Under standard assumptions—i\ell_i-Lipschitz gradients for each objective, appropriate stepsize η(0,1/w]\eta \in (0, 1/\ell_w] with w=wii\ell_w = \sum w_i \ell_i, and c[0,1)c \in [0,1)—RACO iterates converge to Pareto-critical points. Specifically, every limit point θ\theta^\star is both a stationary point of LwL_w and Pareto-critical for (L1,,Lk)(L_1,\ldots,L_k). The sublinear convergence rate in terms of the squared gradient norm is

min0t<TLw(θt)22Lw(θ0)η(1c2)T.\min_{0 \leq t < T} \|\nabla L_w(\theta_t)\|^2 \leq \frac{2 L_w(\theta_0)}{\eta(1-c^2) T}.

A corresponding bound for the Pareto-criticality measure M(θ)=minλΔkλiLi(θ)M(\theta) = \min_{\lambda \in \Delta_k} \| \sum \lambda_i \nabla L_i(\theta) \| applies, since M(θ)Lw(θ)M(\theta) \leq \|\nabla L_w(\theta)\|.

For k=2k=2, explicit analysis shows clipped CAGrad strictly accelerates per-step descent whenever the gradients are not colinear and clipping is active (Chen et al., 2 Feb 2026).

4. Projection Optimization and Reward-Free Extensions

Alternative RACO instantiations leverage Blackwell approachability. The projections-based RACO method reformulates alignment as iteratively steering the vector of axis-wise utilities S(π)=(f1(π),,fm(π))S(\pi) = (f_1(\pi),\ldots,f_m(\pi))^\top into a convex target set Wp,cαW_{p,c}^\alpha, parameterized by the preference aggregation function Φ(S)=(iαiSip)1/p\Phi(S) = \left(\sum_i \alpha_i S_i^p\right)^{1/p} with p1p \leq 1, αΔm1\alpha \in \Delta_{m-1}, and a threshold cc.

At each round, RACO computes the projection direction dt+1=ProjDir(Vt;W)d^{t+1} = \operatorname{ProjDir}(\overline V^t; W), then optimizes the next sub-problem πt+1argmaxπi=1mdit+1fi(π)\pi^{t+1} \leftarrow \arg\max_\pi \sum_{i=1}^m d_i^{t+1} f_i(\pi). The final policy is the mixture πˉT=1Tt=1Tπt\bar{\pi}_T = \frac{1}{T} \sum_{t=1}^T \pi^t.

Crucially, these sub-problems use only linear combinations of the base objectives—never forming the aggregate reward explicitly—permitting the use of DPO-style updates and eliminating the need to train or query explicit reward models at any stage (i.e., fully reward-free operation). Under standard linearity and boundedness conditions, RACO achieves regret RT=O(1/T)R_T = O(1/\sqrt{T}) (Xiong et al., 21 Feb 2025).

5. Extensions: Multi-Group Consensus and Constrained RL Connections

RACO generalizes to multi-group alignment by compositing convex targets W(n)W^{(n)} for each user group nn with their own aggregation parameters. Consensus policies are obtained either by projecting into W=nW(n)W = \cap_n W^{(n)} or by minimizing a composite malfare (e.g., (nζnd(S(π),W(n))2q)1/(2q)(\sum_n \zeta_n d(S(\pi), W^{(n)})^{2q})^{1/(2q)}), updating projection steps accordingly. If group aggregation parameters are not known, maximum likelihood estimation interleaved with policy updates can be used. All regret and convergence guarantees extend, up to O(1/T)O(1/\sqrt{T}) error from weight estimation (Xiong et al., 21 Feb 2025).

In the RL context, reward-free constrained RL or approachability can be reduced to the RACO meta-algorithm: reward-free exploration phase (dynamics model) plus an online convex optimization loop enforcing the constraints or driving the vectorial return to a target set. The separation of exploration (reward-free oracle) and constraint satisfaction (Fenchel dual/OCO) guarantees sample optimality. For tabular and linearly-parameterized MDPs, this achieves the best known or matching sample complexity rates for constrained RL and zero-sum Markov games (Miryoosefi et al., 2021).

6. Algorithmic Instantiations and Practical Considerations

RACO implementations can take several algorithmic forms:

  • Direct CAGrad-Clip updates for LLM alignment with pairwise preference losses (DPO objectives) and standard gradient-based optimization. Clipping parameters cc are typically tuned per model and task (e.g., c=0.4c=0.4 for Qwen/Llama-Instruct, c=0.7c=0.7–$0.8$ for Gemma) (Chen et al., 2 Feb 2026).
  • Projection Optimization with DPO-style decoding: when individual policies for each objective have been trained to convergence, RACO’s mixture is computed in closed form via MOD decoding, πt(yx)πri(yx)ditπ^t(y|x) \propto \prod \pi_{r_i}(y|x)^{d_i^t}, leading to nearly training-free policy integration (Xiong et al., 21 Feb 2025).
  • MODPO (Multi-Objective Direct Preference Optimization) as a reward-free surrogate: policies are trained directly using a multi-head DPO loss over all objectives, optionally parameterized by weight prompts to enable smooth interpolation on the Pareto frontier. The approach is computationally efficient and supports single-model coverage of the trade-off surface (Zhou et al., 2023).

Common empirical heuristics include gradient normalization/clipping, leveraging standard TRL library practices, and batch size/sequencing optimization to stabilize large-model training (Chen et al., 2 Feb 2026).

7. Empirical Performance and Limitations

Across benchmarks in multi-objective summarization (Reddit quality-conciseness, quality-faithfulness) and safety alignment (helpfulness-harmlessness), RACO consistently yields wider, more Pareto-optimal trade-offs compared to baseline weighted-loss schemes (DPO-LW, AMoPO) in Qwen, Llama, and Gemma families. In particular, on the BeaverTails safety alignment set, RACO's Pareto fronts dominate alternatives, attaining win rates of 50–80% against existing multi-objective methods at various weights (Chen et al., 2 Feb 2026).

RACO operates on static preference data and is subject to batching and gradient efficiency at high objective counts. Open directions include online/active preference acquisition, scaling to large kk, adaptive tuning of algorithmic hyperparameters, and rigorous extension of convergence acceleration beyond the two-objective regime.


In summary, RACO constitutes a theoretically sound, practically viable framework for multi-objective and multi-group alignment, operating without explicit reward models and with robust convergence to user-weighted Pareto-critical solutions. Its conflict resolution, projection optimization, and reward-free planning design are supported by both rigorous theoretical analysis and extensive empirical validation (Chen et al., 2 Feb 2026, Xiong et al., 21 Feb 2025, Zhou et al., 2023, Miryoosefi et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward-free Alignment framework for Conflicted Objectives (RACO).