Objective Functions in Reinforcement Learning

Updated 16 October 2025

Objective functions in RL are criteria that guide agent behavior, traditionally defined as the expected sum of discounted rewards and extended with lambda-return and error-based formulations.
Non-cumulative and multi-objective formulations, such as max/min and fairness-driven objectives, require state augmentation and novel Bellman operators to tailor performance metrics.
Recent advances integrate formalism hierarchies, preference-based models, and inference methods to boost expressivity, learnability, and adaptability in complex real-world tasks.

Objective functions in reinforcement learning (RL) specify the quantitative criterion that a policy aims to maximize—these functions determine agent behavior across task domains and underlie the connection between RL, control theory, and probabilistic modeling. While the canonical objective is the expected (possibly discounted) sum of immediate rewards, a wide array of variants—including non-cumulative, multi-objective, survival-oriented, fairness-driven, and temporal logic-based definitions—has been developed to capture richer forms of agent performance. The design, expressivity, and learnability of RL objectives are now central themes in both theoretical and applied research.

1. Classical and Generalized Objective Function Definitions

The standard formulation of the RL objective is the expected sum of discounted rewards: $J(\pi) = \mathbb{E}_{s_0\sim \rho_0} \left[ V_\pi(s_0) \right],\quad \text{with}\quad V_\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_{t+1} \mid s_0=s \right]$ where $\pi$ is the policy, $r_{t+1}$ is the reward at time $t+1$ , and $\gamma$ is the discount factor. This objective is equivalent to maximizing the expected discounted state visitation frequency weighted by expected immediate reward (Yang, 2023).

Generalizations include:

$\lambda$ -Return Objectives: Incorporate eligibility traces (weights of $(\gamma\lambda)^t$ ), unifying multi-step return averaging as in TD( $\lambda$ ) (Yang, 2023).
Error-based Objectives: More generally, use an arbitrary baseline $\phi(s)$ to transform the standard objective into one expressed in terms of TD errors. This reveals connections to GAE (Generalized Advantage Estimation) and underlies surrogate objectives for policy optimization (Yang, 2023).

These general forms admit various RL techniques—on-policy, off-policy, actor-critic, lambda-returns—all as special cases under an encompassing mathematical framework.

2. Non-Standard and Non-Cumulative Objectives

Certain domains naturally require maximization of functionals over the whole reward sequence, with objectives that are not simple sums:

Non-cumulative objectives: Maximizing functions such as $\max_t r_t$ , $\min_t r_t$ , or the Sharpe ratio

$f(r_0, r_1, \ldots, r_{T-1}) = \max\left\{r_0, \ldots, r_{T-1}\right\},\quad \text{or} \quad \frac{\text{mean}(r)}{\text{std}(r)}$

Classical MDPs cannot represent these objectives directly (Nägele et al., 22 May 2024, Cui et al., 2023, Gottipati et al., 2020). To address this, research has developed:

Reward redefinition with state augmentation: The objective $E_\pi\left[f(r_0,\ldots,r_{T-1})\right]$ is mapped to a new MDP with augmented state $(\tilde{s}_t, x_t)$ , where $x_t$ is a memory variable accumulating trajectory history. Adapted rewards are set as $r_t = f(r_0, ..., r_t) - f(r_0, ... r_{t-1})$ , so that sum of adapted rewards equals $f$ applied to the trajectory (Nägele et al., 22 May 2024).
Generalized Bellman operators: Replacing the sum in the Bellman equation with an operation $g$ corresponding to the non-cumulative objective (e.g., min, max, harmonic mean), leading to update rules such as $Q(s, a) \leftarrow g(r_t,\, \gamma \max_{a'} Q(s', a'))$ (Cui et al., 2023).
Max-reward formulations: Specialized recursive Bellman updates that back up the maximal (rather than cumulative) reward, proven to be contraction mappings under standard conditions (Gottipati et al., 2020).

These approaches make non-cumulative objectives amenable to standard RL techniques, provided sufficient care in state augmentation or operator design, with theoretical results guaranteeing convergence under contraction mappings and practical demonstrations in domains ranging from drug discovery to control and finance (Gottipati et al., 2020, Nägele et al., 22 May 2024).

3. Multi-Objective, Preference-Based, and Fairness Solutions

In domains with multiple conflicting objectives, the scalar reward paradigm is inadequate:

Linear scalarization: Weighted sums of per-objective rewards, $r = w^\top r$ , remain the de facto standard but restrict solutions to convex portions of the Pareto front (Friedman et al., 2018, Dornheim, 2022). Deep RL methods now enable a single policy network to generalize over weight vectors, supporting dynamic trade-off adjustment at inference (Friedman et al., 2018, Chang et al., 2022).
Nonlinear, thresholded, and vector-based objectives: To access non-convex Pareto regions, methods like thresholded lexicographic ordering (TLO) and its deep variants (gTLO) use non-linear action selection and modified value updates, supporting robust coverage of arbitrary Pareto sets even in high-dimensional state-action spaces (Dornheim, 2022).
Preference-based MORL (Pb-MORL): Rather than relying on explicit reward design, Pb-MORL derives a multi-objective reward model by fitting to pairwise preferences over trajectory segments under varying weights. Theoretically, this enables full exploration of the Pareto frontier provided preferences are consistent and segments are long (Mu et al., 18 Jul 2025). Policy conditioning on the weight vector allows flexible adaptation post-training.
Fairness and welfare optimization: Objectives such as Nash Social Welfare (geometric mean of vector-valued cumulative rewards) (Fan et al., 2022) or similar fair welfare functions require non-linear scalarizations. These are generally intractable for expected welfare optimization but can be approached using nonlinear Q-learning with non-stationary policies—tracking accumulated rewards across dimensions yields policies that maximize welfare (e.g., proportional fairness) rather than mere sum or weighted sum.

The main implication is that by moving beyond scalar or linearly scalarized objectives—using vector value functions, thresholded modes, and preference or welfare-based models—RL can robustly solve real-world multi-criteria tasks with improved sample-efficiency, adaptability, fairness, and interpretability.

4. Specification Formalisms, Expressivity, and Learnability

Research has extensively analyzed how different ways of specifying RL objectives relate in expressive power and learnability:

Formalism hierarchy: Eighteen distinct objective-specification paradigms (including Markov rewards, limit average, LTL, reward machines, (non-)linear multi-objective RL, functions from trajectories, and direct policy orderings) have been partially ordered with respect to what agent behaviors they can represent (Subramani et al., 2023). Each formalism has strengths/limitations. No single formalism is maximally expressive and universally practical.
Markov rewards (MR): Classical, easy to optimize, but restricted—cannot express certain discontinuous orderings or non-Markovian properties.
Temporal logic objectives: LTL-based (omega-regular) objectives encode rich safety/liveness properties with automata-based reductions that convert satisfaction into reachability—enabling standard model-free RL algorithms and resolving practical issues inherent in Rabin automata (Hahn et al., 2018).
Nonlinear (inner/outer) and vector objectives: More expressive but harder to optimize and require advanced aggregation or history augmentation (Subramani et al., 2023).
Preference and policy ordering formalisms: Maximally expressive in principle, but rarely constructive for optimization.

PAC-learnability (Probably Approximately Correct learning) is tightly linked to the mathematical properties of the objective:

Sufficient conditions: Uniform continuity or computability of the mapping from full trajectories to objective values enables PAC learning—i.e., one can guarantee $\epsilon$ -optimality with finite samples and computation (Yang et al., 2023).
Many new objectives based on automata, temporal logic, or reward machines satisfy these conditions and are thus PAC-learnable, confirming their theoretical tractability.

5. Advances in Objective Design: Survival, Goal-driven, and Dual-objective Formulations

Several advanced objective formulations have emerged to address specialized domains:

Survival objectives: Survival is formalized as maximizing the probability of remaining alive over $T$ steps, $P(\bar{\mathcal{A}}_T|\pi)$ . This is converted into expected sum RL via a reward $r_t = \log P(A_{t+1} = 1|s_t)$ , rigorously shown to correspond to a variational lower bound on long-term survival probability (Yoshida, 2016). Empirical results confirm that such shaping leads to emergent survival behaviors.
Outcome-driven RL via inference: Variational formulations view RL as posterior inference over trajectories conditioned on desired outcomes (Rudner et al., 2021). The resulting variational reward (log-likelihood of reaching the outcome) yields well-shaped, automatically adapting rewards with dense gradients, eliminating ad hoc shaping and enabling robust goal-directed policy acquisition via corresponding probabilistic Bellman operators.
Dual-objective HJB formulations: Dual-objective RL tasks such as “reach-always-avoid” (max-min over extrema) and “reach-reach” (min of two pathwise maxima) require value functions and Bellman updates fundamentally different from standard cumulative RL or temporal logic-based approaches. Explicit, tractable forms can be derived from the Hamilton-Jacobi-Bellman (HJB) framework, with provable optimality and demonstrated sample efficiency (Sharpless et al., 19 Jun 2025).

These approaches highlight that tailored objective definitions—guided by problem structure, probabilistic modeling, or control-theoretic decomposition—lead to value functions and learning rules with qualitative behaviors distinct from classical cumulative-sum RL, and are often required for domains involving hard constraints, risk-sensitivity, safety, or multi-target achievement.

6. Practical Implications and Future Research Directions

The expanding landscape of RL objective function design has several practical implications:

In non-cumulative and measure-specific tasks (portfolio optimization, network routing, symbolic optimization), mapping or transforming objectives (with memory/state augmentation) is essential for leveraging existing RL algorithms (Nägele et al., 22 May 2024, Cui et al., 2023).
Policy architectures that accept dynamic preference parameters, vector rewards, or thresholds deliver immediate adaptation to changing operational priorities, crucial for rapidly reconfigurable real-world systems (Chang et al., 2022, Friedman et al., 2018).
Nonlinear and fairness-oriented objectives, including proportional fairness and Nash Social Welfare, improve resource allocation and equity in multi-user or resource-limited environments, albeit at increased computational cost (Fan et al., 2022).
Expressivity analyses caution that increasing objective flexibility may incur optimization or specification bottlenecks, suggesting a trade-off landscape: ease of optimization, sample complexity, and representational power (Subramani et al., 2023).
For reward engineering, preference-based and temporal logic–driven frameworks reduce the burden on domain experts, allowing qualitative user inputs, logical goals, or preference queries to drive robust RL policy learning (Zhao et al., 2021, Mu et al., 18 Jul 2025).

Future research is expected to focus on:

Robust, sample-efficient algorithms for broader classes of non-Markovian, nonlinear, or trajectory-based objectives.
Theoretical and empirical characterization of which objective functions admit efficient state or memory augmentation mappings.
Methods to blend reward learning, active querying (preference elicitation), and logical specification in practical systems, including safety-critical and non-stationary tasks.
Optimization techniques and surrogates for tractability: handling expressions that exceed the scope of scalar or convex-combinatorial methods, or that require satisfaction of hard constraints in addition to performance.

7. Summary Table: Paradigms of RL Objective Functions

Formalism/Paradigm	Expressivity	Typical Solution Method
Scalar Markovian Rewards (MR)	Limited to additive/continuous	Standard RL algorithms
Linear/Convex Scalarized Rewards	Convex Pareto front only	Weight-parameterized deep RL
Nonlinear/Thresholded Objectives	Non-convex, discontinuous	gTLO, decomposed Q-updates
Preference-based Objectives	Potentially maximally expressive	Pb-MORL, reward model learning
Omega-Regular/Logic-based	Arbitrary temporal patterns	Automata, product MDP
Outcome-driven/Inference-based	Dense, shaped, adaptive rewards	Variational inference, ODAC
Non-cumulative (e.g., max/min)	Arbitrary functionals	State augmentation, new Bellman
Fairness/Welfare (e.g., NSW)	Nonlinear, balanced outcomes	Nonlinear Q-learning

This diversity of objective function paradigms reflects the increasing generality and sophistication of RL as it is applied across a range of scientific, engineering, and decision-theoretic domains. The ongoing effort is to match the expressivity of objective formalism to the demands of the application, while developing learning algorithms and theory that guarantee robust, efficient policy optimization within that chosen framework.