RACO: Reward-Free Alignment for Conflicted Objectives
- The paper introduces RACO, a framework that fine-tunes models using direct preference data instead of explicit rewards for handling conflicting objectives.
- It leverages conflict-averse gradient descent with per-coordinate clipping to resolve gradient conflicts and achieve efficient Pareto-critical updates.
- The framework provides theoretical guarantees including sublinear regret and convergence, extending naturally to multi-group and constrained RL scenarios.
A Reward-free Alignment framework for Conflicted Objectives (RACO) provides a principled, reward-model-free approach to fine-tuning machine learning models—primarily LLMs—under multiple, potentially conflicting alignment objectives. RACO leverages direct preference data rather than explicit scalarized rewards, resolves gradient conflicts algorithmically, supports Pareto-optimal policy discovery, and extends naturally to multi-group and constrained RL settings. Its theoretical guarantees encompass sublinear regret and convergence to Pareto-critical points that respect user-specified objective weights. RACO builds on several advances including projection optimization, conflict-averse descent, direct preference optimization, and reward-free planning oracles (Chen et al., 2 Feb 2026, Xiong et al., 21 Feb 2025, Zhou et al., 2023, Miryoosefi et al., 2021).
1. Formalization: Objectives, Preference Losses, and Pareto Optimality
RACO addresses alignment in settings with conflicting objectives. Let denote model parameters. The preference-derived loss functions, , are computed from pairwise human preference datasets for each axis (e.g., helpfulness, harmlessness, conciseness, factuality). The gradient of each objective is .
A parameter point is Pareto-optimal if there does not exist a common descent direction such that for all , , i.e., no infinitesimal update will improve all objectives simultaneously. User trade-off preferences are encoded as weights , with the anchor gradient corresponding to the weighted aggregate loss . RACO seeks points that are Pareto-critical and that align with these weights (Chen et al., 2 Feb 2026).
2. Conflict-Averse Gradient Descent with Clipping
RACO's core update mechanism is a clipped variant of conflict-averse gradient descent (CAGrad). Two objectives are in conflict at if their gradients are negatively aligned, i.e., . At each iteration , CAGrad solves for a mixing vector that minimizes , where and is the correction hyperparameter.
The descent direction without clipping is . To prevent excessive influence from low-weight objectives, RACO applies per-coordinate clipping: , forming the clipped mixture . The final update is (if ) and the parameter update is (Chen et al., 2 Feb 2026).
This approach ensures that each update direction accounts for gradient conflicts without violating user intent, and the clipped correction strictly improves per-step descent in two-objective scenarios.
3. Theoretical Guarantees and Convergence Properties
Under standard assumptions—-Lipschitz gradients for each objective, appropriate stepsize with , and —RACO iterates converge to Pareto-critical points. Specifically, every limit point is both a stationary point of and Pareto-critical for . The sublinear convergence rate in terms of the squared gradient norm is
A corresponding bound for the Pareto-criticality measure applies, since .
For , explicit analysis shows clipped CAGrad strictly accelerates per-step descent whenever the gradients are not colinear and clipping is active (Chen et al., 2 Feb 2026).
4. Projection Optimization and Reward-Free Extensions
Alternative RACO instantiations leverage Blackwell approachability. The projections-based RACO method reformulates alignment as iteratively steering the vector of axis-wise utilities into a convex target set , parameterized by the preference aggregation function with , , and a threshold .
At each round, RACO computes the projection direction , then optimizes the next sub-problem . The final policy is the mixture .
Crucially, these sub-problems use only linear combinations of the base objectives—never forming the aggregate reward explicitly—permitting the use of DPO-style updates and eliminating the need to train or query explicit reward models at any stage (i.e., fully reward-free operation). Under standard linearity and boundedness conditions, RACO achieves regret (Xiong et al., 21 Feb 2025).
5. Extensions: Multi-Group Consensus and Constrained RL Connections
RACO generalizes to multi-group alignment by compositing convex targets for each user group with their own aggregation parameters. Consensus policies are obtained either by projecting into or by minimizing a composite malfare (e.g., ), updating projection steps accordingly. If group aggregation parameters are not known, maximum likelihood estimation interleaved with policy updates can be used. All regret and convergence guarantees extend, up to error from weight estimation (Xiong et al., 21 Feb 2025).
In the RL context, reward-free constrained RL or approachability can be reduced to the RACO meta-algorithm: reward-free exploration phase (dynamics model) plus an online convex optimization loop enforcing the constraints or driving the vectorial return to a target set. The separation of exploration (reward-free oracle) and constraint satisfaction (Fenchel dual/OCO) guarantees sample optimality. For tabular and linearly-parameterized MDPs, this achieves the best known or matching sample complexity rates for constrained RL and zero-sum Markov games (Miryoosefi et al., 2021).
6. Algorithmic Instantiations and Practical Considerations
RACO implementations can take several algorithmic forms:
- Direct CAGrad-Clip updates for LLM alignment with pairwise preference losses (DPO objectives) and standard gradient-based optimization. Clipping parameters are typically tuned per model and task (e.g., for Qwen/Llama-Instruct, –$0.8$ for Gemma) (Chen et al., 2 Feb 2026).
- Projection Optimization with DPO-style decoding: when individual policies for each objective have been trained to convergence, RACO’s mixture is computed in closed form via MOD decoding, , leading to nearly training-free policy integration (Xiong et al., 21 Feb 2025).
- MODPO (Multi-Objective Direct Preference Optimization) as a reward-free surrogate: policies are trained directly using a multi-head DPO loss over all objectives, optionally parameterized by weight prompts to enable smooth interpolation on the Pareto frontier. The approach is computationally efficient and supports single-model coverage of the trade-off surface (Zhou et al., 2023).
Common empirical heuristics include gradient normalization/clipping, leveraging standard TRL library practices, and batch size/sequencing optimization to stabilize large-model training (Chen et al., 2 Feb 2026).
7. Empirical Performance and Limitations
Across benchmarks in multi-objective summarization (Reddit quality-conciseness, quality-faithfulness) and safety alignment (helpfulness-harmlessness), RACO consistently yields wider, more Pareto-optimal trade-offs compared to baseline weighted-loss schemes (DPO-LW, AMoPO) in Qwen, Llama, and Gemma families. In particular, on the BeaverTails safety alignment set, RACO's Pareto fronts dominate alternatives, attaining win rates of 50–80% against existing multi-objective methods at various weights (Chen et al., 2 Feb 2026).
RACO operates on static preference data and is subject to batching and gradient efficiency at high objective counts. Open directions include online/active preference acquisition, scaling to large , adaptive tuning of algorithmic hyperparameters, and rigorous extension of convergence acceleration beyond the two-objective regime.
In summary, RACO constitutes a theoretically sound, practically viable framework for multi-objective and multi-group alignment, operating without explicit reward models and with robust convergence to user-weighted Pareto-critical solutions. Its conflict resolution, projection optimization, and reward-free planning design are supported by both rigorous theoretical analysis and extensive empirical validation (Chen et al., 2 Feb 2026, Xiong et al., 21 Feb 2025, Zhou et al., 2023, Miryoosefi et al., 2021).