Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reinforcement Learning: Non-Linear Rewards

Updated 2 March 2026
  • Reinforcement learning for non-linear rewards is a framework that optimizes non-additive, trajectory-level utilities, incorporating multi-objective, risk-sensitive, and fairness considerations.
  • Advanced methodologies such as state augmentation, generalized Bellman equations, and modified policy gradients are employed to address the challenges posed by non-linear reward functions.
  • Empirical evaluations demonstrate enhanced sample efficiency, robustness over traditional methods, and practical improvements in applications like fairness, safe exploration, and sparse reward scenarios.

Reinforcement learning (RL) for non-linear rewards encompasses algorithms, theory, and applications in which the optimization criterion is a non-linear function of a policy’s long-term behavior, rather than the classical expectation of linearly additive rewards. This paradigm subsumes multi-objective scalarization, risk-sensitive objectives, fairness preferences, diversity, global (trajectory-level) returns, and non-standard reward structures arising in practical domains. Non-linear RL fundamentally challenges the core architectural and algorithmic principles underlying dynamic programming, value iteration, and traditional policy-gradient approaches predicated on the linearity of the reward functional. Emerging frameworks address these challenges via augmented state representations, geometric perspectives on occupancy measures, submodular optimization, generalized Bellman equations, and sample-efficient function approximation.

1. Problem Formulation and Theoretical Foundations

At the foundation, RL for non-linear rewards generalizes the Markov decision process (MDP; or for multiple objectives, MOMDP) as follows:

  • Classic MDP: Expected cumulative or discounted sum of per-step rewards is maximized, i.e., Eπ[t=0T1γtr(st,at)]\mathbb{E}_\pi[\sum_{t=0}^{T-1} \gamma^t r(s_t, a_t)].
  • Non-Linear Reward RL: The objective is U(μπ)U(\mu^\pi), where UU is a non-linear (not necessarily convex/concave) function of the occupancy measure μπ\mu^\pi, the vector of long-term frequencies or return statistics induced by policy π\pi.

For multi-objective settings, typical forms include maximizing

Vπ(s1)=Eτπ[W(R(τ))]V^\pi(s_1) = \mathbb{E}_{\tau \sim \pi} \left[ W(R(\tau)) \right]

where W:RdRW: \mathbb{R}^d \to \mathbb{R} is a non-linear scalarization and R(τ)R(\tau) is the vector of cumulative rewards along trajectory τ\tau (Peng et al., 2023). In global RL (“trajectory MDPs”), the reward is a set-function F(τ)F(\tau) over the entire trajectory, e.g., capturing submodularity or synergy (Santi et al., 2024). Generalized linear MDPs admit a GLM–modelled reward expectation g(θϕ(x,a))g(\theta^\top \phi(x, a)) combined with linear transition features (Zhang et al., 1 Jun 2025).

Key implication: Linear expectation and time-decomposition no longer hold. For non-linear WW or FF, E[W(R)]W(E[R])E[W(R)] \neq W(E[R]) and Bellman’s principle of optimality generically fails without state augmentation.

2. Extended Bellman Equations and State Augmentation

For non-linear scalarization (e.g., Nash Welfare, proportional fairness), the “value-to-go” is not a function solely of the current state sts_t. Instead, it depends on the accumulated reward zz up to tt. The augmented state-value function becomes

V(s,z,t)=maxπE[W(z+τ=t+1Tγτ1R(sτ,aτ))st=s]V^*(s, z, t) = \max_\pi \mathbb{E}\left[ W\left(z + \sum_{\tau'=t+1}^{T} \gamma^{\tau'-1} R(s_{\tau'},a_{\tau'}) \right) \mid s_t = s \right]

with boundary condition V(s,z,0)=W(z)V^*(s, z, 0) = W(z). Backward equations for t>0t > 0:

V(s,z,t)=maxaA(s)sP(ss,a)V(s,z+γTtR(s,a),t1)V^*(s,z,t)=\max_{a \in A(s)} \sum_{s'} P(s'|s,a) V^*(s', z+\gamma^{T-t} R(s,a), t-1)

(Peng et al., 2023, Agarwal et al., 2019).

Significance: Solution procedures must track accumulated rewards as part of the state. Optimal policies are generally non-stationary in (s,z)(s,z) and may require explicit memory of reward accumulation, breaking Markovian sufficiency as classically defined.

3. Algorithmic Approaches

3.1 Discretization and Approximate Dynamic Programming

To render the augmented state space tractable, discretization is applied: reward vectors zz are quantized on a grid of spacing α\alpha, producing a finite grid GG for value iteration over (s,z,t)(s,z,t) (Peng et al., 2023). The “Reward-Aware Value Iteration” (RAVI) algorithm iteratively backs up the value function over this grid, with approximation guarantees:

  • For scalarization WW that is Lipschitz-continuous and dd fixed, α=δW(ε)/(Td)\alpha = \delta_W(\varepsilon)/(T d) ensures approximation error <ε< \varepsilon.
  • Computational complexity is O(S2A(T/α)d)O(|S|^2 |A| (T/\alpha)^d).

3.2 Off-policy and Policy Gradient Approaches

When UU or WW is differentiable, policy-gradient algorithms can be generalized. For an arbitrary utility U(μπ)U(\mu^\pi), the general utility policy gradient is (Kumar et al., 2022, Milosevic et al., 1 Sep 2025):

θU(μπθ)=Esdπ, aπθ[QUπ(s,a)θlogπθ(as)]\nabla_\theta U(\mu^{\pi_\theta}) = \mathbb{E}_{s \sim d_\pi,\ a \sim \pi_\theta} \left[ Q_U^\pi(s,a) \nabla_\theta \log \pi_\theta(a|s) \right]

where QUπ(s,a)Q_U^\pi(s,a) is the action-value under the “marginal utility” RUπ(s,a)=Uμ(s,a)μ=μπR_U^\pi(s,a) = \frac{\partial U}{\partial \mu(s,a)}|_{\mu = \mu^\pi} and dπ(s)d_\pi(s) is the discounted visitation distribution (Kumar et al., 2022).

For multi-objective and Pareto criteria, methods such as gTLO (generalized thresholded lexicographic ordering) train deep networks to represent vector-valued Q-functions and execute non-linear action selection policies conditioned on threshold hyperparameters (Dornheim, 2022).

3.3 Model-based Convex Optimization

For stationary average reward objectives f(λπ)f(\lambda_\pi), where λπ\lambda_\pi is the steady-state reward vector, the model-based approach re-casts the MDP as a flow-constrained convex program over occupancy measures dd:

maxd0f(s,ark(s,a)d(s,a) k)  s.t. flow constraints, s,ad(s,a)=1\max_{d \ge 0} f(\sum_{s,a} r_k(s,a) d(s,a)\ \forall k )\ \ \text{s.t.}\ \text{flow constraints},\ \sum_{s,a} d(s,a)=1

Policies are then extracted as d(s,a)/ad(s,a)d(s,a)/\sum_{a'} d(s,a') (Agarwal et al., 2019).

3.4 Submodular and Global Rewards

For trajectory-level objectives F(τ)F(\tau), submodular semi-gradient methods (GTO/GPO) iteratively construct tight modular lower bounds mm at the current solution, reduce to an additive MDP, and repeatedly solve with standard RL methods (Santi et al., 2024). Curvature-dependent approximation guarantees relate the suboptimality to the submodular or supermodular curvature of FF.

3.5 Generalized Linear MDPs (GLMDP)

GLMDPs extend the linear MDP framework to non-linear (e.g., logistic or Poisson) reward expectations through a reward GLM and maintain Bellman completeness by enlarging the Q-function class to g(θϕr(x,a))+βϕp(x,a)g(\theta^\top \phi_r(x,a)) + \beta^\top \phi_p(x,a). Efficient value-iteration-style algorithms (GPEVI/SS-GPEVI) yield statistical guarantees on suboptimality (Zhang et al., 1 Jun 2025).

4. Practical Considerations and Applications

Algorithmic and representational choices impact empirical tractability:

  • Grid discretization in RAVI is exponential in dd; thus, practical for moderate dd and small-ish TT (Peng et al., 2023).
  • Function approximation (deep networks) mitigates combinatorial explosion in state-action-accumulation space, especially in gTLO and global RL variants (Dornheim, 2022, Santi et al., 2024).
  • Umbrella RL (URL) efficiently handles sparse, multimodal, or otherwise “hard” nonlinear rewards by augmenting reward with ensemble entropy and solves the resulting coupled PDEs with neural nets, yielding robust convergence in high-dimensional, nonconvex reward landscapes (Nuzhin et al., 2024).
  • Global RL enables formally tractable optimization for set functions exhibiting sub/supermodularity, common in exploration, coverage, and experiment design.

Specific domains:

5. Empirical Evaluation and Theoretical Guarantees

Experiments across gridworld, continuous control, and real-world manufacturing tasks consistently show:

6. Open Challenges and Future Directions

  • Scalability to high-dimensional objectives: All grid-based and semi-gradient MDP reductions are exponential in dd; novel approximations (random projections, subspace embedding) are needed for large-scale applications (Peng et al., 2023).
  • Function approximation under non-linear objectives: How to ensure stability, representation, and gradient computation for neural approximators of (s,z)(s, z)-dependent value functions remains an open area, particularly for global set-function rewards (Dornheim, 2022, Santi et al., 2024, Kumar et al., 2022).
  • Theory of convergence and Bellman operators for non-linear objectives: Formal convergence criteria and error bounds for nonconvex and non-additive utilities are largely unresolved (Milosevic et al., 1 Sep 2025).
  • Efficient estimation and sample complexity: While semi-supervised methods and pessimistic value iteration in GLMDP show improved sample efficiency when reward observation is costly, analysis under function approximation or partial observability is still in development (Zhang et al., 1 Jun 2025).
  • Generalization to continuous or infinite state-action spaces: The geometric approach calls for measure-theoretic mirror descent, curved exponential family representations, and principled extensions of occupancy-based policy gradients (Milosevic et al., 1 Sep 2025).

7. Summary Table: Representative Algorithms and Their Properties

Approach Nonlinearity Type Policy Stationarity Approximation Guarantees
Reward-Aware Value Iteration (RAVI) Nonlinear scalarization (WW) Non-stationary ε\varepsilon-approximation, exponential in dd (Peng et al., 2023)
gTLO (deep RL) Nonlinear lexicographic Stationary (τ-conditional) Pareto/threshold coverage, sample-efficient (Dornheim, 2022)
Model-based convex program Concave global utility (ff) Stationary O(T)O(\sqrt{T}) regret (Agarwal et al., 2019)
Policy-gradient with general UU Arbitrary smooth utility Stationary Stationary-point for general UU, global optimum for concave UU (Kumar et al., 2022)
Submodular semi-gradient (GTO/GPO) Sub/supermodular trajectory Non-Markov/non-stationary Curvature-based suboptimality (Santi et al., 2024)
Umbrella RL (URL) Arbitrary/hard nonlinear Stationary Empirically robust to sparse rewards (Nuzhin et al., 2024)
GPEVI/SS-GPEVI (GLMDP) Generalized linear (GLM) Stationary Statistical suboptimality, semi-supervised gains (Zhang et al., 1 Jun 2025)

All methods fundamentally rely on either augmenting the state (memory) or reparameterizing the occupancy measure, employing advanced function approximation to address the loss of decomposability introduced by non-linear objectives.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reinforcement Learning for Non-Linear Rewards.