Reinforcement Learning: Non-Linear Rewards

Updated 2 March 2026

Reinforcement learning for non-linear rewards is a framework that optimizes non-additive, trajectory-level utilities, incorporating multi-objective, risk-sensitive, and fairness considerations.
Advanced methodologies such as state augmentation, generalized Bellman equations, and modified policy gradients are employed to address the challenges posed by non-linear reward functions.
Empirical evaluations demonstrate enhanced sample efficiency, robustness over traditional methods, and practical improvements in applications like fairness, safe exploration, and sparse reward scenarios.

Reinforcement learning (RL) for non-linear rewards encompasses algorithms, theory, and applications in which the optimization criterion is a non-linear function of a policy’s long-term behavior, rather than the classical expectation of linearly additive rewards. This paradigm subsumes multi-objective scalarization, risk-sensitive objectives, fairness preferences, diversity, global (trajectory-level) returns, and non-standard reward structures arising in practical domains. Non-linear RL fundamentally challenges the core architectural and algorithmic principles underlying dynamic programming, value iteration, and traditional policy-gradient approaches predicated on the linearity of the reward functional. Emerging frameworks address these challenges via augmented state representations, geometric perspectives on occupancy measures, submodular optimization, generalized Bellman equations, and sample-efficient function approximation.

1. Problem Formulation and Theoretical Foundations

At the foundation, RL for non-linear rewards generalizes the Markov decision process (MDP; or for multiple objectives, MOMDP) as follows:

Classic MDP: Expected cumulative or discounted sum of per-step rewards is maximized, i.e., $\mathbb{E}_\pi[\sum_{t=0}^{T-1} \gamma^t r(s_t, a_t)]$ .
Non-Linear Reward RL: The objective is $U(\mu^\pi)$ , where $U$ is a non-linear (not necessarily convex/concave) function of the occupancy measure $\mu^\pi$ , the vector of long-term frequencies or return statistics induced by policy $\pi$ .

For multi-objective settings, typical forms include maximizing

$V^\pi(s_1) = \mathbb{E}_{\tau \sim \pi} \left[ W(R(\tau)) \right]$

where $W: \mathbb{R}^d \to \mathbb{R}$ is a non-linear scalarization and $R(\tau)$ is the vector of cumulative rewards along trajectory $\tau$ (Peng et al., 2023). In global RL (“trajectory MDPs”), the reward is a set-function $F(\tau)$ over the entire trajectory, e.g., capturing submodularity or synergy (Santi et al., 2024). Generalized linear MDPs admit a GLM–modelled reward expectation $g(\theta^\top \phi(x, a))$ combined with linear transition features (Zhang et al., 1 Jun 2025).

Key implication: Linear expectation and time-decomposition no longer hold. For non-linear $W$ or $F$ , $E[W(R)] \neq W(E[R])$ and Bellman’s principle of optimality generically fails without state augmentation.

2. Extended Bellman Equations and State Augmentation

For non-linear scalarization (e.g., Nash Welfare, proportional fairness), the “value-to-go” is not a function solely of the current state $s_t$ . Instead, it depends on the accumulated reward $z$ up to $t$ . The augmented state-value function becomes

$V^*(s, z, t) = \max_\pi \mathbb{E}\left[ W\left(z + \sum_{\tau'=t+1}^{T} \gamma^{\tau'-1} R(s_{\tau'},a_{\tau'}) \right) \mid s_t = s \right]$

with boundary condition $V^*(s, z, 0) = W(z)$ . Backward equations for $t > 0$ :

$V^*(s,z,t)=\max_{a \in A(s)} \sum_{s'} P(s'|s,a) V^*(s', z+\gamma^{T-t} R(s,a), t-1)$

(Peng et al., 2023, Agarwal et al., 2019).

Significance: Solution procedures must track accumulated rewards as part of the state. Optimal policies are generally non-stationary in $(s,z)$ and may require explicit memory of reward accumulation, breaking Markovian sufficiency as classically defined.

3. Algorithmic Approaches

3.1 Discretization and Approximate Dynamic Programming

To render the augmented state space tractable, discretization is applied: reward vectors $z$ are quantized on a grid of spacing $\alpha$ , producing a finite grid $G$ for value iteration over $(s,z,t)$ (Peng et al., 2023). The “Reward-Aware Value Iteration” (RAVI) algorithm iteratively backs up the value function over this grid, with approximation guarantees:

For scalarization $W$ that is Lipschitz-continuous and $d$ fixed, $\alpha = \delta_W(\varepsilon)/(T d)$ ensures approximation error $< \varepsilon$ .
Computational complexity is $O(|S|^2 |A| (T/\alpha)^d)$ .

3.2 Off-policy and Policy Gradient Approaches

When $U$ or $W$ is differentiable, policy-gradient algorithms can be generalized. For an arbitrary utility $U(\mu^\pi)$ , the general utility policy gradient is (Kumar et al., 2022, Milosevic et al., 1 Sep 2025):

$\nabla_\theta U(\mu^{\pi_\theta}) = \mathbb{E}_{s \sim d_\pi,\ a \sim \pi_\theta} \left[ Q_U^\pi(s,a) \nabla_\theta \log \pi_\theta(a|s) \right]$

where $Q_U^\pi(s,a)$ is the action-value under the “marginal utility” $R_U^\pi(s,a) = \frac{\partial U}{\partial \mu(s,a)}|_{\mu = \mu^\pi}$ and $d_\pi(s)$ is the discounted visitation distribution (Kumar et al., 2022).

For multi-objective and Pareto criteria, methods such as gTLO (generalized thresholded lexicographic ordering) train deep networks to represent vector-valued Q-functions and execute non-linear action selection policies conditioned on threshold hyperparameters (Dornheim, 2022).

3.3 Model-based Convex Optimization

For stationary average reward objectives $f(\lambda_\pi)$ , where $\lambda_\pi$ is the steady-state reward vector, the model-based approach re-casts the MDP as a flow-constrained convex program over occupancy measures $d$ :

$\max_{d \ge 0} f(\sum_{s,a} r_k(s,a) d(s,a)\ \forall k )\ \ \text{s.t.}\ \text{flow constraints},\ \sum_{s,a} d(s,a)=1$

Policies are then extracted as $d(s,a)/\sum_{a'} d(s,a')$ (Agarwal et al., 2019).

3.4 Submodular and Global Rewards

For trajectory-level objectives $F(\tau)$ , submodular semi-gradient methods (GTO/GPO) iteratively construct tight modular lower bounds $m$ at the current solution, reduce to an additive MDP, and repeatedly solve with standard RL methods (Santi et al., 2024). Curvature-dependent approximation guarantees relate the suboptimality to the submodular or supermodular curvature of $F$ .

3.5 Generalized Linear MDPs (GLMDP)

GLMDPs extend the linear MDP framework to non-linear (e.g., logistic or Poisson) reward expectations through a reward GLM and maintain Bellman completeness by enlarging the Q-function class to $g(\theta^\top \phi_r(x,a)) + \beta^\top \phi_p(x,a)$ . Efficient value-iteration-style algorithms (GPEVI/SS-GPEVI) yield statistical guarantees on suboptimality (Zhang et al., 1 Jun 2025).

4. Practical Considerations and Applications

Algorithmic and representational choices impact empirical tractability:

Grid discretization in RAVI is exponential in $d$ ; thus, practical for moderate $d$ and small-ish $T$ (Peng et al., 2023).
Function approximation (deep networks) mitigates combinatorial explosion in state-action-accumulation space, especially in gTLO and global RL variants (Dornheim, 2022, Santi et al., 2024).
Umbrella RL (URL) efficiently handles sparse, multimodal, or otherwise “hard” nonlinear rewards by augmenting reward with ensemble entropy and solves the resulting coupled PDEs with neural nets, yielding robust convergence in high-dimensional, nonconvex reward landscapes (Nuzhin et al., 2024).
Global RL enables formally tractable optimization for set functions exhibiting sub/supermodularity, common in exploration, coverage, and experiment design.

Specific domains:

Multi-objective fairness/efficiency (e.g., taxi assignment, queueing, resource allocation) (Peng et al., 2023, Agarwal et al., 2019).
Diversity, safe exploration, and intrinsic motivation (e.g., entropy-maximization, mutual information) (Milosevic et al., 1 Sep 2025, Kumar et al., 2022, Santi et al., 2024).
Manufacturing process control (e.g., generalized deep RL for preferences over non-convex Pareto sets) (Dornheim, 2022).
Sparse reward and exploration (e.g., umbrella RL in mountain car and over-damped arm) (Nuzhin et al., 2024).

5. Empirical Evaluation and Theoretical Guarantees

Experiments across gridworld, continuous control, and real-world manufacturing tasks consistently show:

Substantial gains over linear scalarization or naive per-step surrogates, especially when the non-linear utility is highly nonconvex or requires global reward coordination (Peng et al., 2023, Dornheim, 2022, Santi et al., 2024).
Robustness of value-iteration-based methods under moderate discretization; even coarser grids suffice for near-optimal performance (Peng et al., 2023).
Sample efficiency of model-based and policy-gradient approaches for risk-sensitive, fairness, and non-additive settings (Agarwal et al., 2019, Zhang et al., 1 Jun 2025, Dornheim, 2022).
Formal regret or suboptimality bounds when the utility function is Lipschitz or possesses bounded curvature (Agarwal et al., 2019, Santi et al., 2024, Zhang et al., 1 Jun 2025).
Scalability and convergence of neural-PDE methods (umbrella RL) in high-dimensional and sparse-reward landscapes (Nuzhin et al., 2024).

6. Open Challenges and Future Directions

Scalability to high-dimensional objectives: All grid-based and semi-gradient MDP reductions are exponential in $d$ ; novel approximations (random projections, subspace embedding) are needed for large-scale applications (Peng et al., 2023).
Function approximation under non-linear objectives: How to ensure stability, representation, and gradient computation for neural approximators of $(s, z)$ -dependent value functions remains an open area, particularly for global set-function rewards (Dornheim, 2022, Santi et al., 2024, Kumar et al., 2022).
Theory of convergence and Bellman operators for non-linear objectives: Formal convergence criteria and error bounds for nonconvex and non-additive utilities are largely unresolved (Milosevic et al., 1 Sep 2025).
Efficient estimation and sample complexity: While semi-supervised methods and pessimistic value iteration in GLMDP show improved sample efficiency when reward observation is costly, analysis under function approximation or partial observability is still in development (Zhang et al., 1 Jun 2025).
Generalization to continuous or infinite state-action spaces: The geometric approach calls for measure-theoretic mirror descent, curved exponential family representations, and principled extensions of occupancy-based policy gradients (Milosevic et al., 1 Sep 2025).

7. Summary Table: Representative Algorithms and Their Properties

Approach	Nonlinearity Type	Policy Stationarity	Approximation Guarantees
Reward-Aware Value Iteration (RAVI)	Nonlinear scalarization ( $W$ )	Non-stationary	$\varepsilon$ -approximation, exponential in $d$ (Peng et al., 2023)
gTLO (deep RL)	Nonlinear lexicographic	Stationary (τ-conditional)	Pareto/threshold coverage, sample-efficient (Dornheim, 2022)
Model-based convex program	Concave global utility ( $f$ )	Stationary	$O(\sqrt{T})$ regret (Agarwal et al., 2019)
Policy-gradient with general $U$	Arbitrary smooth utility	Stationary	Stationary-point for general $U$ , global optimum for concave $U$ (Kumar et al., 2022)
Submodular semi-gradient (GTO/GPO)	Sub/supermodular trajectory	Non-Markov/non-stationary	Curvature-based suboptimality (Santi et al., 2024)
Umbrella RL (URL)	Arbitrary/hard nonlinear	Stationary	Empirically robust to sparse rewards (Nuzhin et al., 2024)
GPEVI/SS-GPEVI (GLMDP)	Generalized linear (GLM)	Stationary	Statistical suboptimality, semi-supervised gains (Zhang et al., 1 Jun 2025)

All methods fundamentally rely on either augmenting the state (memory) or reparameterizing the occupancy measure, employing advanced function approximation to address the loss of decomposability introduced by non-linear objectives.

References

(Peng et al., 2023) Multi-objective Reinforcement Learning with Nonlinear Preferences: Provable Approximation for Maximizing Expected Scalarized Return
(Dornheim, 2022) gTLO: A Generalized and Non-linear Multi-Objective Deep Reinforcement Learning Approach
(Zhang et al., 1 Jun 2025) Generalized Linear Markov Decision Process
(Milosevic et al., 1 Sep 2025) The Geometry of Nonlinear Reinforcement Learning
(Agarwal et al., 2019) Reinforcement Learning for Joint Optimization of Multiple Rewards
(Santi et al., 2024) Global Reinforcement Learning: Beyond Linear and Convex Rewards via Submodular Semi-gradient Methods
(Kumar et al., 2022) Policy Gradient for Reinforcement Learning with General Utilities
(Nuzhin et al., 2024) Umbrella Reinforcement Learning -- computationally efficient tool for hard non-linear problems