Dynamic Reward Markov Decision Processes

Updated 5 July 2025

Dynamic Reward MDPs are extended forms of standard MDPs that incorporate history or context-based rewards to capture temporal patterns and evolving objectives.
They leverage temporal logic and automata to specify reward conditions such as event triggers, deadlines, and first-occurrence rewards in a precise manner.
Solution methods include state augmentation, dynamic programming, heuristic search, and reinforcement learning to manage the complexity of history-dependent reward structures.

Dynamic Reward Markov Decision Processes (DR-MDPs) extend the standard Markov Decision Process framework to incorporate history-dependent or temporally dynamic reward structures. In DR-MDPs, the immediate or future rewards depend not just on the current state and action but on sequences of events, progressive changes, or the realization of contextual conditions over time. This allows for the modeling of objectives that are naturally expressed in terms of temporal patterns, deadlines, event triggers, or evolving preferences, which are common in many real-world decision-theoretic planning problems.

1. Foundations and Definitions

A classical Markov Decision Process (MDP) is defined by a tuple $(S, A, P, R, \gamma)$ , where $S$ is a finite state space, $A$ an action space, $P$ the transition kernel, $R$ the immediate reward function, and $\gamma$ a discount factor. A key property is Markovianity: $R(s, a)$ and $P(s'|s,a)$ depend only on the most recent state and action. In contrast, in Dynamic Reward MDPs (or in the closely related Non-Markovian Reward Decision Processes, NMRDPs), the reward function $R$ is allowed to depend on histories: $R: (S \times A)^* \rightarrow \mathbb{R}$ where $(S \times A)^*$ denotes all finite sequences (histories) of state-action pairs.

These reward functions allow specifications such as rewarding only the first occurrence of an event, penalizing repeated failures, or enforcing “response” properties—e.g., giving a reward only if a certain condition is met after a trigger (1109.2355, 1912.02552).

DR-MDPs naturally formalize models where rewards are temporal, dynamic, and possibly even adversarially or stochastically changing between episodes or periods (1905.10649, 2110.03743).

2. Formal Specification: Temporal Logic and Automata

Specification of dynamic rewards in DR-MDPs is often achieved via temporal logic or automata-theoretic constructs:

Temporal Logic: Compact reward specification is commonly achieved using Past Linear Temporal Logic (PLTL) or Future Linear Temporal Logic (FLTL). For example, a reward can be specified via formulas like $(p \rightarrow \$) $, denoting a reward when$ p $occurs, or$ (-p \; U \; (p \wedge \ $))$ , denoting a reward the first time $p$ holds after it was previously false (1109.2355, 1301.0606).
Automata and Mealy Machines: DR-MDP rewards can be encoded as regular functions of the history via Deterministic Finite Automata (DFA) or Mealy Machines (Mealy Reward Machines, MRMs), where the automaton augments the state with a reward tracker (1912.02552, 2001.09293, 2009.12600). Each transition in the automaton is paired with a reward output, and the reward at each step is a function of the automaton’s current state and the sequence of past (abstracted) events.

This allows one to reduce the DR-MDP or NMRDP to an equivalent Markovian model over an expanded (“product”) state space: $S' = S \times U$ , where $U$ is the set of automaton states (1301.0606, 1912.02552, 2001.09293).

3. Solution Approaches and Algorithms

a. State Augmentation and Translation

The standard approach to solving DR-MDPs is to translate the original problem into an equivalent MDP with an expanded state space encoding both the underlying system state and sufficient history information (the “e-state”). For reward representations specified in logic or automata, this is achieved via “progression” or “synchronization”:

For temporal logic, progression functions push the reward specification forward one step at a time as execution proceeds, yielding an updated formula that tracks what still needs to be satisfied for the reward to be earned (1109.2355, 1301.0606).
For automata, the automaton state is updated based on current observations, and the immediate reward is determined by the automaton's transition output (2009.12600, 2001.09293, 1912.02552).

b. Solution Methods

Three main classes of solution techniques are prevalent:

Method	Description	Implementation Notes
Dynamic Programming	Value iteration or policy iteration applied to the expanded state space (e-states) (1109.2355).	The state space can grow rapidly with the length/complexity of the reward specification.
Heuristic (Anytime) Search	Algorithms such as LAO*, RTDP, and LRTDP incrementally build a reachable envelope of the e-state space, improving policies over time (1109.2355, 1301.0606).	Embeds model-checking/progression, explores only relevant parts of the e-state space (“blind minimality”).
Structured Methods	Symbolic approaches (e.g., SPUDD, ADDs) representing both the dynamics and temporal variables symbolically (1109.2355).	Effective when the reward and transition system admit concise symbolic representations; can dynamically prune irrelevant structure.

c. Reinforcement Learning with Learned Reward Structure

For unknown reward models, techniques combine classical RL with active automata learning (e.g., using Angluin’s L* algorithm), synchronizing the learned automaton with the MDP to form a Markovian RL problem in the product space (1912.02552, 2009.12600, 2001.09293). RL algorithms such as Q-learning or R-max are then applied to learn optimal policies on this augmented space, with formal convergence guarantees when the learned automaton converges (1912.02552).

d. Robust and Large-Scale Methods

When rewards are not only history-dependent but may change adversarially or stochastically over time, robust and online optimization techniques are used:

Online convex optimization methods over occupancy measures (e.g., Regularized Follow-the-Leader, RFTL) yield regret-minimizing policies even as rewards change arbitrarily, with bounds scaling as $O(\sqrt{T})$ (1905.10649).
For high-dimensional or large-scale MDPs, linear architecture approximations drastically reduce per-iteration computation while retaining similar regret guarantees (1905.10649).

4. The Role of Uncertainty and Robustness

Many DR-MDP applications require making decisions under uncertainty about the reward specification or its temporal dynamics:

Distributionally Robust Approaches: Reward vectors may be random with only partial information (e.g., via moments, $\phi$ -divergence, or Wasserstein distance balls). Distributionally robust chance-constrained MDPs seek policies that maximize reward under the worst-case within these sets, reformulated as second-order cone programs, copositive optimization, or MISOCP depending on the nature of the uncertainty (2212.08126).
Regret-Based Reward Elicitation: Instead of precise reward specification, policies are selected to minimize maximum regret over a feasible set of reward functions, with reward queries focused dynamically on most impactful parameters (1205.2619).
Mixture and Switching Models: In contexts where the reward model itself changes between episodes (reward-mixing MDPs), efficient algorithms estimate higher-order correlations to disambiguate which reward model is active, sometimes leveraging lifted MDP representations (2110.03743).

5. Expressivity, Objectives, and Practical Implications

Expressivity and Multidimensional Rewards: Scalar rewards may be insufficient to characterize complex, temporally extended objectives. Multidimensional rewards, mapping $(S, A)$ to $\mathbb{R}^d$ , allow for the separation of policy classes that cannot be distinguished with a single scalar metric. Necessary and sufficient conditions for such representation are given in terms of convex separation of policy visitation vectors (2307.12184).
Non-Cumulative and Dynamic Objectives: Some problems require maximizing a non-cumulative functional (e.g., maximum, Sharpe ratio, mean/stdev) over reward trajectories. General mappings have been developed to translate NCMDPs into standard MDPs with appropriately augmented state and reward representations, enabling application of all standard MDP algorithms (2405.13609):

$r_t = f(\tilde{r}_0,\ldots, \tilde{r}_t) - f(\tilde{r}_0,\ldots, \tilde{r}_{t-1})$
Risk and Distributional Properties: For objectives beyond average return (e.g., risk measures, quantiles), only a limited class of functionals—specifically, generalized means related to exponential utilities—can be optimized exactly via the BeLLMan recursion. More general risk-conscious strategies can be approximated using distributional RL, with explicit error bounds (2310.20266).
Concentration and Regret: Concentration properties of cumulative reward in MDPs—through asymptotic and finite-time martingale techniques—allow high-probability guarantees for performance and regret in dynamic environments. Rate equivalence of different regret definitions aids the theoretical grounding of learning in DR-MDPs (2411.18551).

6. Applications and Domains

Practical domains motivating DR-MDPs include:

Healthcare: Reward only the first service of a patient, model time-dependent interventions.
Autonomic Computing and Resource Allocation: Preferences or “utilities” change over time or as a function of past actions.
Robotics and Automation: Correct sequencing of tasks, dynamic goals with conditional triggers.
Finance: Maximize risk-adjusted returns (e.g., Sharpe ratio), optimize over maximum drawdown, or similar ratios.
Engineering Benchmarks: Domains like the Miconic elevator, where service constraints are temporal and dynamic (1109.2355).
Partially Observable and Adversarial Environments: Switches in reward models between episodes, as in reward-mixing DR-MDPs (2110.03743).

Empirical results indicate that appropriately specified DR-MDP methods provide superior performance, require fewer reward queries, and enable near-optimal or robust policies in complex, dynamic environments (1109.2355, 1205.2619, 1905.10649, 1912.02552, 2405.13609).

7. Limitations and Ongoing Challenges

The expansion of the state space—either through temporal logic progression or automata synchronization—may be exponential in the size of the reward specification or history, posing computational challenges (1109.2355, 1301.0606).
Advantages of structured or anytime state-based approaches are contingent on the ability to dynamically ignore or prune irrelevant history components (1301.0606).
Robustness to uncertainty in the reward or transition model may introduce conservatism, especially if uncertainty sets are not tightly specified (2212.08126, 2505.18044).
Many theoretical results for risk-sensitive planning show that only a narrow class of objectives can be optimized exactly; approximation and error control strategies must be employed for functionals outside this set (2310.20266).
Practical implementation of highly dynamic or non-cumulative objectives requires careful design of auxiliary state variables to enable efficient mapping into Markovian representations (2405.13609).

Dynamic Reward MDPs constitute a rich and expressive formalism for modeling sequential decision-making in environments where the evaluation of behavior depends on history, context, or dynamically evolving reward structures. Recent advances in temporal logic progression, automata-based reward modeling, robust optimization, and reinforcement learning with learned or uncertain reward functions have produced a diverse ecosystem of solution methods, each balancing expressivity, computational tractability, and robustness to uncertainty. These techniques have been tested in a wide array of domains and remain an active area of research, especially as applications demand increasing complexity in reward specification and adaptation to dynamic, unpredictable environments.