Multi-Objective Reward Function

Updated 22 December 2025

Multi-objective reward function is a vector-valued mapping that preserves individual objectives like safety, efficiency, and user preferences without reducing them to a single scalar.
Deep reinforcement learning architectures condition policies on preference vectors to interpolate efficiently across trade-offs and support robust off-policy learning.
Nonlinear scalarization methods enable exploration of non-convex Pareto fronts while preference learning and reward normalization mitigate reward hacking and balance competing objectives.

A multi-objective reward function is a vector-valued mapping from environment states and actions to ℝ^m, where m is the number of objectives of interest. Instead of collapsing multiple competing desiderata—such as task success, efficiency, safety, or user preferences—into a single scalar, the multi-objective reward maintains explicit structure across objectives. In reinforcement learning and sequential decision-making, multi-objective reward functions are foundational for expressing tasks with inherent trade-offs, supporting user-customizable behaviors, and enabling robust learning in environments where objectives are in tension or evolve dynamically. Modern approaches leverage these vector-valued rewards to enable explicit preference weighting, generalization across trade-off specifications, and direct exploration of the Pareto front over possible returns.

1. Mathematical Formulation and Scalarization

Let $r_t \in \mathbb{R}^m$ denote the vector reward observed at time $t$ , with $m$ objectives. Preferences among objectives are encoded by a weight vector $w = (w_1, \dots, w_m)^\top$ , where $w_i \geq 0$ and $\sum_i w_i = 1$ . The scalarized reward is given by

$R_w(s_t, a_t) = w^\top r_t.$

This linear scalarization reduces the multi-objective problem to a conventional scalar reward MDP for fixed $w$ , allowing standard RL techniques to be employed (Friedman et al., 2018, Kusari et al., 2019).

The Pareto front of achievable vector returns arises by varying $w$ over the unit simplex. However, non-linear scalarizations—such as lexicographic ordering, thresholded approaches (Dornheim, 2022, Rustagi et al., 13 Feb 2025), Chebyshev/Tchebycheff max-weighted objectives (Qiu et al., 24 Jul 2024), or Nash Social Welfare (Fan et al., 2022)—provide additional expressivity, enabling non-convex Pareto regions and direct incorporation of fairness or context.

2. Deep RL Architectures and Policy Conditioning

Deep multi-objective RL architectures condition both actor and critic networks on the current preference vector $w$ . For example, in MO-HER (Multi-Objective Hindsight Experience Replay), actors $\pi(s, w)$ and critics $Q(s, a, w)$ concatenate state and weight inputs, supporting interpolation over arbitrary trade-offs during inference (Friedman et al., 2018).

This weight-conditioning enables a single network to generalize across the entire weight simplex, offering off-policy efficiency and continuous-action compatibility. Data augmentation by sampling multiple $w$ per transition and relabeling corresponding experience enables shared gradient signals over diverse objectives without additional environment interaction.

For tasks where only a subset of objectives is relevant at a given time, stage-wise or contextual approaches segment policies by regime (e.g., in acrobatic robots, per-maneuver stages with stage-dependent rewards and cost vectors), each with its own reward/cost definitions and potentially context-driven lexicographic or prioritization orderings (Kim et al., 24 Sep 2024, Rustagi et al., 13 Feb 2025).

3. Sample-Efficient Generalization and Interpolation

The structure of $V^*(s; w)$ , the optimal value function under a weighted-sum multi-objective reward, is Lipschitz and differentiable in $w$ under standard MDP regularity (Kusari et al., 2019). This facilitates the use of generalization techniques: once a policy or value function is trained under multiple weightings, interpolation (e.g., via Gaussian processes) enables prediction of optimal value functions for novel preferences, circumventing costly retraining per $w$ .

Algorithms such as multi-objective GPSARSA (Ultes et al., 2017), MO-DQN with reward and discount-factor generalization (Cornalba et al., 2022), and similar deep RL variants exploit these principles—relabelling data with randomly sampled $\mathbf w$ and fitting a weight-conditioned $Q$ or value network.

In the LLM alignment domain, preference-parameterized reward models (e.g., ArmoRM+MoE (Wang et al., 18 Jun 2024)) and complementary BT/multi-objective regression frameworks (Zhang et al., 10 Jul 2025) produce interpretable multi-attribute scores, support gating or mixture mechanisms across objectives, and explicitly mitigate reward hacking by decomposing feedback into fine-grained, human-relevant axes.

4. Nonlinear Scalarization and Pareto Optimality

Linear scalarization cannot traverse non-convex Pareto regions. Nonlinear approaches such as generalized Thresholded Lexicographic Ordering (gTLO) (Dornheim, 2022), contextual lexicographic RL (Rustagi et al., 13 Feb 2025), and Chebyshev/Tchebycheff scalarization (Qiu et al., 24 Jul 2024) are required.

For lexicographically ordered specifications, the agent enforces minimum thresholds on prioritized objectives, subsequently optimizing lower-priority rewards only within the admissible action set. This structure accommodates context-dependent priority orderings and can be extended via Bayesian inference of context mappings from demonstrations.

Tchebycheff scalarization,

$\mathrm{TCH}_\lambda(\pi) = \max_{i} \lambda_i (V_i^* - V_i^\pi),$

with strictly positive $\lambda$ , traverses the full set of weakly Pareto-optimal policies and avoids the degeneracies of linear scalarization, yielding provably efficient algorithms for finding all Pareto solutions with $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity (Qiu et al., 24 Jul 2024).

Welfare-based scalarizations (Nash social welfare, proportional fairness) provide fairness and scale invariance, but optimization is computationally harder and generally requires non-stationary or bi-level policy selection (Fan et al., 2022).

5. Preference Learning, Human Feedback, and Robustness

Direct specification of objective weights $w$ is often impractical; instead, preferences are inferred from human feedback using pairwise comparisons, demonstrations, or absolute ratings. Reward modeling frameworks (e.g., ArmoRM, Pb-MORL) build vector-valued reward models aligned to underlying human or task-driven trade-offs via regression and Bradley-Terry likelihoods (Wang et al., 18 Jun 2024, Zhang et al., 10 Jul 2025, Mu et al., 18 Jul 2025).

Active preference elicitation (e.g., MORAL (Peschl et al., 2021)) adaptively queries for the most informative pairwise comparisons, maintaining Bayesian posteriors over $w$ and updating the policy using the expected scalarized reward. Unified BT and regression heads in reward models are complementary: attribute-vector heads encourage fine-grained improvement and block reward hacking, while pairwise heads provide robustness and scalability when attribute annotation is scarce (Zhang et al., 10 Jul 2025).

Contextual, stage-wise, or lexicographically varying policies further incorporate shifts in reward structure during operation or dynamically according to environment labels (Kim et al., 24 Sep 2024, Rustagi et al., 13 Feb 2025). Mixture-of-experts gating, as in ArmoRM+MoE, enables context-sensitive weighting of objectives and decomposable, interpretable output.

6. Optimization and Reward Balancing: Algorithms and Practical Insights

Optimization of multi-objective reward functions is performed using off-the-shelf RL algorithms with scalarized rewards or specialized multi-objective extensions. Weight-conditioned policy/value function networks can be efficiently trained by relabelling transitions with randomly sampled weights and using data augmentation from the vector reward (Friedman et al., 2018, Ultes et al., 2017, Cornalba et al., 2022).

Advanced RLHF and multi-objective policy algorithms employ explicit normalization of reward scales (e.g., MO-GRPO variance normalization (Ichihara et al., 26 Sep 2025)) to prevent objectives with large variance from dominating gradients and to mitigate reward hacking. Stochastic exploration in the reward-parameter space, meta-gradient or bi-level optimization of reward-component weights (e.g., MORSE (Xie et al., 17 Dec 2025)), and preference-free exploration for reward-free planning (Wu et al., 2020, Qiu et al., 24 Jul 2024) further enhance robustness and efficiency.

Table: Representative Multi-Objective Reward Function Approaches

Method	Scalarization Type	Policy Generalization
Weighted linear sum	Linear	Weight-conditioned network
Chebyshev/Tchebycheff	Nonlinear (max)	Min-max RL, dual optimization
Lexicographic ordering	Nonlinear (priority)	Threshold-conditioned network
Nash SW, log-sum	Nonlinear (concave)	Welfare Q-learning/Stationary
Contextual/Stage-wise	Piecewise/Contextual	Stage-conditional multi-policy
Preference learning	Adaptive	RM learned from pairwise data

7. Limitations and Research Directions

Multi-objective reward function approaches face inherent scalability challenges as the number of objectives increases, requiring smarter sampling or dimensionality reduction in the simplex (Friedman et al., 2018). Capacity to generalize across the weight space is also fundamentally constrained by the expressivity of the underlying policy/value approximator.

Linear scalarization fails to recover non-convex Pareto regions and may concentrate learning on a subset of objectives. Nonlinear or lexicographic scalarizations demand specialized backup operators and careful policy extraction, and may be sensitive to misspecification or to the presence of sharp trade-off regions.

Interpretability and robustness remain major focuses, particularly in RLHF contexts where reward hacking is a concern (Wang et al., 18 Jun 2024, Zhang et al., 10 Jul 2025, Ichihara et al., 26 Sep 2025). Empirical ablations demonstrate the critical role of decomposable proxies, context-dependent weighting, and explicit normalization, while theoretical work has established both lower and upper bounds for regret, exploration complexity, and convergence to the Pareto front (Wu et al., 2020, Qiu et al., 24 Jul 2024).

Continued advances in efficient preference elicitation, scalable representation, dynamic context adaptation, and theory-guided optimization are central to making multi-objective reward functions effective tools for complex, real-world RL tasks.