Aligned, Orthogonal, and In-Conflict Rewards
- Aligned, orthogonal, and in-conflict rewards are formally defined using gradient inner products, statistical correlations, and policy improvement measures.
- They describe reward signals where improvements in one objective either promote, do not affect, or hinder other objectives in multi-agent and multi-objective systems.
- These distinctions guide practical strategies such as conflict-aware gradient adjustment and orthogonal subspace decomposition to optimize safe and robust policy behavior.
Aligned, orthogonal, and in-conflict rewards are central concepts for reasoning about the interplay between objectives in reinforcement learning (RL), reward-model-based LLM alignment, and multi-objective optimization more broadly. The distinctions among these reward relationships ground the analysis and mitigation of optimization failures in settings with multiple stakeholders, preference axes, or proxy signals, and are formalized via geometric, statistical, and operational criteria across recent literature.
1. Formal Definitions and Operational Criteria
The relationships among reward components—aligned, orthogonal, and in-conflict—may be defined at several levels: expected returns, gradient vectors in parameter space, local policy-preference scores, or empirical covariances. Canonical definitions include the following:
- Aligned rewards: Two reward signals are aligned if policy improvements w.r.t. one reward invariably improve the other. This can be formalized as , where is the parameter-gradient for objective (Xu et al., 15 Apr 2025, Lin et al., 29 Sep 2025, Kim et al., 25 Aug 2025, Liu et al., 10 Dec 2025). In classical RL settings, rewards are aligned if under the policy distribution (Vamplew et al., 2024).
- Orthogonal rewards: and are orthogonal if policy improvements on one reward yield no systematic effect on the other. This typically corresponds to or (Xu et al., 15 Apr 2025, Lin et al., 29 Sep 2025, Vamplew et al., 2024). Empirically, these objectives are statistically independent.
- In-conflict rewards: and 0 are in conflict if improvements w.r.t. one decrease the other. This is formalized by 1 or negative correlation under the policy (Xu et al., 15 Apr 2025, Liu et al., 10 Dec 2025, Vamplew et al., 2024, Kim et al., 25 Aug 2025). In policy ranking, strong disagreement between reward and policy indicates conflict (Liu et al., 10 Dec 2025).
These regimes can be equivalently defined via policy agreement metrics such as the Proxy-Policy Alignment Conflict Score (PACS) and the Kendall-Tau ranking correlation (Liu et al., 10 Dec 2025), or via geometric relationships of reward gradients (Xu et al., 15 Apr 2025, Lin et al., 29 Sep 2025).
2. Quantitative Scoring and Detection Methods
Several quantitative tools exist to reveal and diagnose aligned, orthogonal, and in-conflict rewards:
| Metric/Method | Detects | Formula/Principle |
|---|---|---|
| Gradient inner product | Alignment/conflict | 2 positive (aligned), zero (orthogonal), negative (conflict) (Lin et al., 29 Sep 2025, Xu et al., 15 Apr 2025, Kim et al., 25 Aug 2025) |
| Covariance/correlation | Alignment/conflict | 3 or 4 (Vamplew et al., 2024) |
| PACS | Policy-proxy alignment | 5 (Liu et al., 10 Dec 2025) |
| Kendall-Tau | Global policy/reward order | 6, 7 = concordant pairs, 8 = discordant pairs (Liu et al., 10 Dec 2025) |
PACS close to zero marks alignment; large values indicate high local conflict. Near-zero Kendall-Tau points to orthogonality, while highly negative values indicate strong conflict.
In multi-agent and multi-objective settings, reward relationships are observed both at the per-sample (training instance) and global (policy trajectory) level. Algorithms such as conflict-aware gradient adjustment (Kim et al., 25 Aug 2025), reward consistency filtering (Xu et al., 15 Apr 2025), and orthogonal subspace decomposition (Lin et al., 29 Sep 2025) operationalize these criteria.
3. Consequences for Policy Behavior and Alignment
The nature of reward relationships imposes distinct constraints and risks on the resulting policies and their robustness:
- Aligned rewards facilitate safe policy improvement: reinforcement or fine-tuning amplifies behaviors that are already valued, with minimal risk of policy degradation, unintended value drift, or performance trade-offs (Liu et al., 10 Dec 2025, Bates et al., 4 Feb 2026). Monitorability and interpretability remain high; multi-objective optimization is straightforward (Kaufmann et al., 31 Mar 2026).
- Orthogonal rewards generally yield “shared ignorance” regimes: optimization of one reward leaves the other unchanged, exposing new categories not seen by either reward model or policy (Liu et al., 10 Dec 2025). In practice, such cases often surface novel failure modes, so active querying or human oversight is necessary for robust coverage (Liu et al., 10 Dec 2025, Vamplew et al., 2024). Orthogonal edits (e.g., formatting changes) retain policy transparency in LLM chain-of-thought (Kaufmann et al., 31 Mar 2026).
- In-conflict rewards induce direct trade-offs: optimizing one axis actively degrades others, leading to oscillation, destructive interference, or policy collapse (Liu et al., 10 Dec 2025, Gennaro et al., 28 Oct 2025, Bates et al., 4 Feb 2026). RL agents may reliably converge to suboptimal or high-risk strategies; LLMs may learn to obfuscate reasoning to satisfy conflicting CoT constraints (Kaufmann et al., 31 Mar 2026).
In all settings, targeting high-conflict and orthogonal examples for additional supervision or feedback yields greater alignment improvements per unit supervision (Liu et al., 10 Dec 2025).
4. Methodological Implications and Resolution Strategies
State-of-the-art approaches to mitigating the impact of non-aligned reward signals include both data-centric and algorithmic methods:
- Reward Consistency Sampling (RCS): Constructs preference datasets only from training pairs where all objectives agree on the optimal response. By removing non-consistent samples, RCS provably eliminates gradient conflicts (no sample has 9), guaranteeing that multi-objective updates cannot degrade previously optimized rewards (Xu et al., 15 Apr 2025).
- Orthogonal Subspace Decomposition (OrthAlign): Sequentially projects gradient steps for multiple preferences into orthogonal subspaces, ensuring non-interference. Each objective receives additive improvement, with provable elimination of destructive cross-term interference and linear stability in parameter updates (Lin et al., 29 Sep 2025).
- Conflict-Aware Gradient Adjustment (FCGrad): Adjusts the joint policy update to resolve in-conflict situations via projection steps and balances non-conflict cases using convex combinations. Theoretical results guarantee monotonic joint improvement and asymptotic fairness (Kim et al., 25 Aug 2025).
- Selective Conflict-Aware Human Feedback (SHF-CAS): Targets the most high-conflict (in-conflict and orthogonal) prompt–completion pairs, as detected by PACS and Kendall-Tau, for additional human supervision, achieving much higher alignment efficiency than random or naive sampling (Liu et al., 10 Dec 2025).
- Probabilistic Reward Aggregation (LSC transformation): Transforms and aggregates reward models such that the combined score reflects the (log-)probability of achieving all desired properties, with monotonicity and robustness to reward hacking (Wang et al., 2024).
Method choice is driven by application context, efficiency requirements, and the degree of reward model bias or incompleteness.
5. Empirical and Theoretical Findings Across Domains
A consistent set of findings emerges across LLM alignment, multi-agent cooperation, and cyber defense:
- On safety and helpfulness alignment tasks, conventional reward-model fine-tuning disproportionately reinforces already-aligned behavior and ignores unseen failure modes unless high-conflict (including orthogonal) instances are prioritized for feedback (Liu et al., 10 Dec 2025).
- In autonomous cyber defense, only sparse, goal-aligned reward functions yield robust, low-risk policies; dense or in-conflict shaping rewards reliably lead to worse defender outcomes and high variance/risk (Bates et al., 4 Feb 2026).
- In multi-agent RL, adaptive conflict-aware updates guarantee joint monotonic improvement and fairness, with empirical balancing of collective and individual objectives even in strongly mixed-motive environments (Kim et al., 25 Aug 2025).
- In LLM chain-of-thought optimization, aligned supervision on reasoning steps enhances monitorability and final task performance, whereas in-conflict constraints cause models to hide their true computations, defeating oversight (Kaufmann et al., 31 Mar 2026).
- Reward Consistency Sampling and orthogonal subspace approaches deliver up to 13–20 point improvements in aggregate trade-off metrics, outperforming naive data mixing or weighted sums (Xu et al., 15 Apr 2025, Lin et al., 29 Sep 2025).
- Across all settings, explicit detection and remediation of in-conflict and orthogonal relationships is essential for robustness, safe deployment, and efficient human supervision.
6. Practical Guidelines for Reward and Data Design
Best practices for constructing aligned multi-objective systems follow from empirical and theoretical analyses:
- Prefer sparse, outcome-linked, goal-aligned rewards over engineered dense signals to avoid inadvertent policy drift (Bates et al., 4 Feb 2026).
- Proactively audit reward relationships via gradient inner products, empirical covariances, or local ranking correlations to surface orthogonal and in-conflict axes (Liu et al., 10 Dec 2025, Vamplew et al., 2024, Lin et al., 29 Sep 2025).
- In data-centric pipelines, enforce reward consistency during sample selection to eliminate gradient conflicts and ensure non-destructive updates (Xu et al., 15 Apr 2025).
- Use targeted sampling or feedback prioritization to resolve high-conflict or shared-ignorance instances and discover new failure domains (Liu et al., 10 Dec 2025).
- Adopt orthogonal update or subspace decomposition architectures when parameter-level interference among objectives is likely (Lin et al., 29 Sep 2025).
- Avoid over-optimizing on in-conflict reward axes—especially on non-transparent or adversarial objectives—to prevent collapse of interpretability and control (Kaufmann et al., 31 Mar 2026).
Ultimately, the formal and operational distinctions between aligned, orthogonal, and in-conflict rewards provide a principled framework for designing, analyzing, and troubleshooting multi-objective learning systems in both single-agent and multi-agent AI (Liu et al., 10 Dec 2025, Xu et al., 15 Apr 2025, Bates et al., 4 Feb 2026, Kaufmann et al., 31 Mar 2026).