Papers
Topics
Authors
Recent
Search
2000 character limit reached

Aligned, Orthogonal, and In-Conflict Rewards

Updated 2 April 2026
  • Aligned, orthogonal, and in-conflict rewards are formally defined using gradient inner products, statistical correlations, and policy improvement measures.
  • They describe reward signals where improvements in one objective either promote, do not affect, or hinder other objectives in multi-agent and multi-objective systems.
  • These distinctions guide practical strategies such as conflict-aware gradient adjustment and orthogonal subspace decomposition to optimize safe and robust policy behavior.

Aligned, orthogonal, and in-conflict rewards are central concepts for reasoning about the interplay between objectives in reinforcement learning (RL), reward-model-based LLM alignment, and multi-objective optimization more broadly. The distinctions among these reward relationships ground the analysis and mitigation of optimization failures in settings with multiple stakeholders, preference axes, or proxy signals, and are formalized via geometric, statistical, and operational criteria across recent literature.

1. Formal Definitions and Operational Criteria

The relationships among reward components—aligned, orthogonal, and in-conflict—may be defined at several levels: expected returns, gradient vectors in parameter space, local policy-preference scores, or empirical covariances. Canonical definitions include the following:

  • Aligned rewards: Two reward signals R1,R2R_1, R_2 are aligned if policy improvements w.r.t. one reward invariably improve the other. This can be formalized as g1,g2>0\langle g_1,g_2 \rangle > 0, where gi=θLig_i = \nabla_\theta L_i is the parameter-gradient for objective ii (Xu et al., 15 Apr 2025, Lin et al., 29 Sep 2025, Kim et al., 25 Aug 2025, Liu et al., 10 Dec 2025). In classical RL settings, rewards are aligned if Corr(R1,R2)0\mathrm{Corr}(R_1, R_2) \gg 0 under the policy distribution (Vamplew et al., 2024).
  • Orthogonal rewards: R1R_1 and R2R_2 are orthogonal if policy improvements on one reward yield no systematic effect on the other. This typically corresponds to g1,g20\langle g_1,g_2 \rangle \approx 0 or Cov(R1,R2)0\mathrm{Cov}(R_1, R_2) \approx 0 (Xu et al., 15 Apr 2025, Lin et al., 29 Sep 2025, Vamplew et al., 2024). Empirically, these objectives are statistically independent.
  • In-conflict rewards: R1R_1 and g1,g2>0\langle g_1,g_2 \rangle > 00 are in conflict if improvements w.r.t. one decrease the other. This is formalized by g1,g2>0\langle g_1,g_2 \rangle > 01 or negative correlation under the policy (Xu et al., 15 Apr 2025, Liu et al., 10 Dec 2025, Vamplew et al., 2024, Kim et al., 25 Aug 2025). In policy ranking, strong disagreement between reward and policy indicates conflict (Liu et al., 10 Dec 2025).

These regimes can be equivalently defined via policy agreement metrics such as the Proxy-Policy Alignment Conflict Score (PACS) and the Kendall-Tau ranking correlation (Liu et al., 10 Dec 2025), or via geometric relationships of reward gradients (Xu et al., 15 Apr 2025, Lin et al., 29 Sep 2025).

2. Quantitative Scoring and Detection Methods

Several quantitative tools exist to reveal and diagnose aligned, orthogonal, and in-conflict rewards:

Metric/Method Detects Formula/Principle
Gradient inner product Alignment/conflict g1,g2>0\langle g_1,g_2 \rangle > 02 positive (aligned), zero (orthogonal), negative (conflict) (Lin et al., 29 Sep 2025, Xu et al., 15 Apr 2025, Kim et al., 25 Aug 2025)
Covariance/correlation Alignment/conflict g1,g2>0\langle g_1,g_2 \rangle > 03 or g1,g2>0\langle g_1,g_2 \rangle > 04 (Vamplew et al., 2024)
PACS Policy-proxy alignment g1,g2>0\langle g_1,g_2 \rangle > 05 (Liu et al., 10 Dec 2025)
Kendall-Tau Global policy/reward order g1,g2>0\langle g_1,g_2 \rangle > 06, g1,g2>0\langle g_1,g_2 \rangle > 07 = concordant pairs, g1,g2>0\langle g_1,g_2 \rangle > 08 = discordant pairs (Liu et al., 10 Dec 2025)

PACS close to zero marks alignment; large values indicate high local conflict. Near-zero Kendall-Tau points to orthogonality, while highly negative values indicate strong conflict.

In multi-agent and multi-objective settings, reward relationships are observed both at the per-sample (training instance) and global (policy trajectory) level. Algorithms such as conflict-aware gradient adjustment (Kim et al., 25 Aug 2025), reward consistency filtering (Xu et al., 15 Apr 2025), and orthogonal subspace decomposition (Lin et al., 29 Sep 2025) operationalize these criteria.

3. Consequences for Policy Behavior and Alignment

The nature of reward relationships imposes distinct constraints and risks on the resulting policies and their robustness:

In all settings, targeting high-conflict and orthogonal examples for additional supervision or feedback yields greater alignment improvements per unit supervision (Liu et al., 10 Dec 2025).

4. Methodological Implications and Resolution Strategies

State-of-the-art approaches to mitigating the impact of non-aligned reward signals include both data-centric and algorithmic methods:

  • Reward Consistency Sampling (RCS): Constructs preference datasets only from training pairs where all objectives agree on the optimal response. By removing non-consistent samples, RCS provably eliminates gradient conflicts (no sample has g1,g2>0\langle g_1,g_2 \rangle > 09), guaranteeing that multi-objective updates cannot degrade previously optimized rewards (Xu et al., 15 Apr 2025).
  • Orthogonal Subspace Decomposition (OrthAlign): Sequentially projects gradient steps for multiple preferences into orthogonal subspaces, ensuring non-interference. Each objective receives additive improvement, with provable elimination of destructive cross-term interference and linear stability in parameter updates (Lin et al., 29 Sep 2025).
  • Conflict-Aware Gradient Adjustment (FCGrad): Adjusts the joint policy update to resolve in-conflict situations via projection steps and balances non-conflict cases using convex combinations. Theoretical results guarantee monotonic joint improvement and asymptotic fairness (Kim et al., 25 Aug 2025).
  • Selective Conflict-Aware Human Feedback (SHF-CAS): Targets the most high-conflict (in-conflict and orthogonal) prompt–completion pairs, as detected by PACS and Kendall-Tau, for additional human supervision, achieving much higher alignment efficiency than random or naive sampling (Liu et al., 10 Dec 2025).
  • Probabilistic Reward Aggregation (LSC transformation): Transforms and aggregates reward models such that the combined score reflects the (log-)probability of achieving all desired properties, with monotonicity and robustness to reward hacking (Wang et al., 2024).

Method choice is driven by application context, efficiency requirements, and the degree of reward model bias or incompleteness.

5. Empirical and Theoretical Findings Across Domains

A consistent set of findings emerges across LLM alignment, multi-agent cooperation, and cyber defense:

  • On safety and helpfulness alignment tasks, conventional reward-model fine-tuning disproportionately reinforces already-aligned behavior and ignores unseen failure modes unless high-conflict (including orthogonal) instances are prioritized for feedback (Liu et al., 10 Dec 2025).
  • In autonomous cyber defense, only sparse, goal-aligned reward functions yield robust, low-risk policies; dense or in-conflict shaping rewards reliably lead to worse defender outcomes and high variance/risk (Bates et al., 4 Feb 2026).
  • In multi-agent RL, adaptive conflict-aware updates guarantee joint monotonic improvement and fairness, with empirical balancing of collective and individual objectives even in strongly mixed-motive environments (Kim et al., 25 Aug 2025).
  • In LLM chain-of-thought optimization, aligned supervision on reasoning steps enhances monitorability and final task performance, whereas in-conflict constraints cause models to hide their true computations, defeating oversight (Kaufmann et al., 31 Mar 2026).
  • Reward Consistency Sampling and orthogonal subspace approaches deliver up to 13–20 point improvements in aggregate trade-off metrics, outperforming naive data mixing or weighted sums (Xu et al., 15 Apr 2025, Lin et al., 29 Sep 2025).
  • Across all settings, explicit detection and remediation of in-conflict and orthogonal relationships is essential for robustness, safe deployment, and efficient human supervision.

6. Practical Guidelines for Reward and Data Design

Best practices for constructing aligned multi-objective systems follow from empirical and theoretical analyses:

  1. Prefer sparse, outcome-linked, goal-aligned rewards over engineered dense signals to avoid inadvertent policy drift (Bates et al., 4 Feb 2026).
  2. Proactively audit reward relationships via gradient inner products, empirical covariances, or local ranking correlations to surface orthogonal and in-conflict axes (Liu et al., 10 Dec 2025, Vamplew et al., 2024, Lin et al., 29 Sep 2025).
  3. In data-centric pipelines, enforce reward consistency during sample selection to eliminate gradient conflicts and ensure non-destructive updates (Xu et al., 15 Apr 2025).
  4. Use targeted sampling or feedback prioritization to resolve high-conflict or shared-ignorance instances and discover new failure domains (Liu et al., 10 Dec 2025).
  5. Adopt orthogonal update or subspace decomposition architectures when parameter-level interference among objectives is likely (Lin et al., 29 Sep 2025).
  6. Avoid over-optimizing on in-conflict reward axes—especially on non-transparent or adversarial objectives—to prevent collapse of interpretability and control (Kaufmann et al., 31 Mar 2026).

Ultimately, the formal and operational distinctions between aligned, orthogonal, and in-conflict rewards provide a principled framework for designing, analyzing, and troubleshooting multi-objective learning systems in both single-agent and multi-agent AI (Liu et al., 10 Dec 2025, Xu et al., 15 Apr 2025, Bates et al., 4 Feb 2026, Kaufmann et al., 31 Mar 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Aligned, Orthogonal, and In-Conflict Rewards.