Complementary Reward Functions

Updated 10 February 2026

Complementary reward functions are designed to integrate multiple, heterogeneous reward signals in reinforcement learning, resolving ambiguities and mitigating issues like reward hacking.
They utilize techniques such as multi-objective RL, recursive aggregation, and distributional modeling to derive scalar signals that enhance policy robustness and safety.
Their strategic integration improves generalization and adaptability in complex tasks, with applications in multi-objective control, human-in-the-loop systems, and curriculum learning.

Complementary reward functions are a foundational concept in modern reinforcement learning (RL) and reward modeling, encompassing the design, combination, and utilization of multiple, often heterogeneous, reward sources to achieve robust, generalizable, and interpretable agent behaviors. The principle of complementarity manifests across multi-objective RL, human-in-the-loop reward specification, curriculum learning, and recent advances in reward modeling for LLMs. Complementarity is achieved by constructing, integrating, or learning reward functions such that their joint use resolves ambiguities, mitigates pathologies (e.g., reward hacking), and supports adaptive or out-of-distribution generalization.

1. Mathematical Formalisms for Complementary Reward Functions

Complementary reward functions frequently appear as vector-valued or multi-source reward models that are then composed, aggregated, or otherwise integrated to yield scalar signals suitable for learning:

Multi-objective RL: The standard setup defines the environment reward as a $k$ -dimensional vector:

$\mathbf r(s_t, a_t) = \begin{bmatrix} r_1(s_t,a_t) \ r_2(s_t,a_t) \ \vdots \ r_k(s_t,a_t) \end{bmatrix}$

Scalarization is performed via nonnegative weights $\mathbf w \in \mathbb R^k$ (with $\sum_i w_i = 1, w_i \geq 0$ ), yielding $r_{\mathbf w}(s_t, a_t) = \mathbf w^\top \mathbf r(s_t, a_t)$ , and producing policies $\pi(s, \mathbf w)$ that adapt to any $\mathbf w$ in the convex hull (Friedman et al., 2018).

Recursive and algebraic aggregation: Alternative aggregation operators $\oplus$ —such as discounted max, Sharpe ratio, or custom statistics—recursively combine reward streams, generalizing the Bellman recursion:

$G_t = \mathrm{fold}_\oplus [r_t, r_{t+1}, \ldots]$

This semiring-based formalism enables policies to optimize for peak, robust, or risk-sensitive objectives without changing the structure of the MDP (Tang et al., 11 Jul 2025).

Distributional RL for multi-dimensional returns: The return is modeled as a distribution over $\mathbb R^N$ , capturing both marginal statistics and cross-reward dependencies (Zhang et al., 2021).
Additive and composite forms: Domains often explicitly construct total rewards as additive or weighted combinations of base rewards, e.g., $R_{\text{total}} = R_{\text{task}} + \lambda R_{\text{style}}$ for task and style (Escontrela et al., 2022).

2. Algorithmic Integration and Learning Approaches

Complementary reward functions require specific algorithmic mechanisms for effective training and policy evaluation:

Experience replay with reward augmentation: Given closed-form access to $\mathbf r(s, a)$ and weight distributions $p(\mathbf w)$ , transitions can be re-labeled with alternative scalarizations, populating the buffer with diverse reward-weight pairs $(s, a, \mathbf w, r, s')$ (Friedman et al., 2018).
Recursive aggregator architectures: Q-learning and policy gradient methods are extended to integrate recursive aggregation operators, e.g., using Bellman-like targets $\text{upd}(r, \max_{a'} Q(s', a'))$ . Actor-critic methods use analogous critics, updating on post-processed statistics (e.g., Sharpe ratio aggregates) (Tang et al., 11 Jul 2025).
Distributional joint modeling: Neural models output samples or parameters for the full joint return distribution, and training minimizes metrics such as maximum mean discrepancy (MMD) between rollout and Bellman targets, capturing cross-reward dependencies (Zhang et al., 2021).
Bi-level and curriculum optimization: In behavior alignment and curriculum RL, outer-loop optimization selects or tunes the blending/transition between complementary reward terms (e.g., primary/auxiliary, dense/sparse, simple/complex) to maximize performance under a fixed or original objective, often using implicit differentiation for scalability (Gupta et al., 2023, Freitag et al., 2024).
Adversarial and learned reward priors: Complementary style or regularization rewards are trained adversarially (e.g., via motion discriminators), then combined additively with task objectives during policy optimization (Escontrela et al., 2022).
Preference-based reward repair and human integration: Corrections or repairs to proxy reward functions are learned from human preference feedback, focusing flexible additive corrections (e.g., neural $\Delta r$ ) only where the proxy and human latent reward diverge (Hatgis-Kessell et al., 14 Oct 2025).

3. Theoretical Guarantees and Desiderata

Rigorous desiderata and theoretical results guide the combination and use of complementary reward functions:

Desideratum	Description	Supported By
Support on independent features (D1)	The combined distribution should support all plausible per-feature combinations of sources	MIRD-IF (Krasheninnikov et al., 2021)
Tradeoff/intermediate support (D2)	Mixture reward should support all convex combinations of source-induced tradeoffs	MIRD, MIRD-IF
Behavioral informativeness (D3)	Agreed-upon behaviors should remain enforced; combinations do not alter them	MIRD
Balance between input behaviors (D4)	Diverse source behaviors should have nonzero likelihood or option value in the combination	MIRD, MIRD-IF
Worst-case regret bounds	The minimum expected return is at least as good as the worst of the inputs, for any true reward	MIRD, MIRD-IF
Monotonic convergence and contraction	Aggregator Bellman operators contract in suitable metrics, guaranteeing unique solutions	(Zhang et al., 2021, Tang et al., 11 Jul 2025)
Preference update informativeness	Greedy information gain for query selection yields near-optimal learning and outperforms naive	DemPref (Bıyık et al., 2020)

These guarantees ensure that policies aligned to complementary reward specifications are robust to input misspecification, generalize across tradeoffs, and optimize well-defined regret criteria (Krasheninnikov et al., 2021, Friedman et al., 2018).

4. Empirical Strategies and Case Studies

Complementary reward functions have been systematically evaluated across benchmark tasks and real-world scenarios, often demonstrating substantial empirical advantages relative to non-complementary or naive approaches:

Multi-objective RL with HER: Policies trained via multi-weight HER generalize over the convex hull of reward weights and exhibit near-optimal tradeoff curves in double integrator tasks (Friedman et al., 2018).
Recursive aggregation: Sharpe-ratio and max-aggregation produce policies with distinct stability, robustness, and exploratory properties in deep RL benchmarks (MuJoCo, gridworlds, S&P500 portfolio) (Tang et al., 11 Jul 2025).
Curriculum learning for complex rewards: Two-stage reward curricula, starting with simple sub-rewards and phasing in more stringent or adversarial terms, both overcome reward local minima and achieve balanced constraint/task solutions (Freitag et al., 2024).
Adversarial style rewards: Learned AMP style rewards, when combined with even simple task objectives, yield significant improvements in policy naturalness, sim-to-real transfer, and energy efficiency vs. hand-crafted multi-term rewards (Escontrela et al., 2022).
Multi-source and human feedback: Preference-based reward repair achieves near-oracle performance with orders of magnitude fewer preferences than learning from scratch, specifically correcting only critical misaligned transitions present in the proxy (Hatgis-Kessell et al., 14 Oct 2025). Integrated demonstration + preference learning (DemPref) accelerates reward discovery and policy alignment (Bıyık et al., 2020).
Complementary reward modeling in LLMs: Joint, multi-objective (regression) and single-objective (Bradley-Terry) reward models trained on shared embeddings achieve higher reward-modeling accuracy, greater resistance to OOD reward hacking, and effective transfer from modestly-sized multi-attribute datasets to large-scale policy improvements (Zhang et al., 10 Jul 2025).

5. Practical Design and Implementation Considerations

Principles and recommendations for leveraging complementary reward functions include:

Reward decomposition: Explicitly enumerate component objectives, normalize their scales, and use empirical preference inequalities to select weights or curriculum stages (Freitag et al., 2024).
Buffer and data management: Store all reward components and/or features to permit flexible re-sampling, off-policy finetuning, or reward function repairs as training progresses (Freitag et al., 2024, Hatgis-Kessell et al., 14 Oct 2025).
Regularization and correction targeting: Focus corrections or regularization (e.g., the three-term PBRR loss) on misaligned transitions, leaving already-correct behavior unchanged (Hatgis-Kessell et al., 14 Oct 2025).
Human-in-the-loop prioritization: Use targeted, information gain–maximizing queries, and stagger demonstrations and preference collection (demonstrations first) for optimal sample efficiency (Bıyık et al., 2020).
Aggregation operator selection: Treat the choice of aggregation as a design hyperparameter, balancing expressivity against statistical and computational tractability; recursive forms enable flexible alignment but may demand larger state or statistic representations (Tang et al., 11 Jul 2025).
Ablative control: Carefully evaluate the impact of each component via ablations; removal or substitution of complementary terms often results in catastrophic reward exploitation, poor transfer, or degenerate local optima (Escontrela et al., 2022, Freitag et al., 2024, Gupta et al., 2023).

6. Limitations and Open Directions

Despite their demonstrated utility, designing, learning, and employing complementary reward functions face several limitations:

Expressiveness constraints: Many methods are limited to linear, additive, or recursive aggregation (e.g., not median or quantile without approximation) (Tang et al., 11 Jul 2025).
Scalability in high-dimensional spaces: Both optimization and posterior inference become computationally intensive as the number of sources or features expands (e.g., MIRD-IF and deep inverse RL settings) (Krasheninnikov et al., 2021).
Dependency on closed-form rewards: Generalization across weightings or aggregation requires closed-form or at least easily computable base reward functions (Friedman et al., 2018).
Data bottlenecks in joint modeling: Success of regression-based multi-attribute modeling is conditioned on the availability of sufficient fine-grained human feedback or attribute annotations (Zhang et al., 10 Jul 2025).
Design of reference policies and proxies: Effectiveness of preference-based repair or curriculum learning depends on the existence of non-catastrophic or decent baseline policies/rewards (Hatgis-Kessell et al., 14 Oct 2025).

Future research directions include the automated or meta-learned discovery of new aggregation operators, extension to non-linear or data-driven composition rules, more scalable inference in high-dimensional behavior and reward spaces, and deeper integration with distributional or uncertainty-aware RL paradigms (Tang et al., 11 Jul 2025, Krasheninnikov et al., 2021, Zhang et al., 10 Jul 2025).

7. Significance and Ongoing Impact

The use of complementary reward functions enables RL agents and learning systems to achieve simultaneously improved safety, robustness, and domain-specific adaptability. This paradigm underpins advances in multi-objective control, preference-based and human-in-the-loop RL, practical sim-to-real transfer, and high-stakes domains such as LLM alignment. Complementarity is characteristic of systems that must reason over uncertain, misspecified, or composite objectives, and is crucial for mitigating vulnerabilities such as reward hacking or overfitting to narrow, single-purpose signals.

The ongoing refinement of algorithms and theoretical frameworks for complementary rewards continues to shape both the scientific and practical frontiers of RL and machine learning (Friedman et al., 2018, Tang et al., 11 Jul 2025, Hatgis-Kessell et al., 14 Oct 2025, Krasheninnikov et al., 2021, Freitag et al., 2024, Escontrela et al., 2022, Zhang et al., 10 Jul 2025, Zhang et al., 2021, Bıyık et al., 2020, Gupta et al., 2023).