State and Reward Modeling in RL

Updated 8 July 2025

State and reward modeling is a framework that formalizes environment dynamics and agent incentives, essential for robust decision-making in reinforcement learning.
It incorporates non-Markovian and risk-sensitive reward structures using temporal logic and automata to capture historical dependencies.
Emerging deep learning and latent representation techniques jointly predict state dynamics and rewards, enhancing sample efficiency and planning.

State and reward modeling in reinforcement learning and decision-theoretic planning concerns the formalization, representation, and computational treatment of how environments are described (state modeling) and how agents receive feedback or incentives (reward modeling). Precise modeling of these elements is critical for defining agent objectives, supporting efficient solution methods, and enabling risk-sensitive or temporally structured behaviors. As the field has evolved, research has addressed both foundational and practical challenges in accurately capturing the complexities of real-world environments, especially where rewards are history-dependent, risk-sensitive, or high-dimensional.

1. Formalizations of State and Reward Functions

At the core of standard Markov decision processes (MDPs), the state is a summary of the environment at each timestep, and the reward function typically provides immediate scalar feedback as a function of the current state and the chosen action. Formally, the reward function is often specified as $R: S \times A \rightarrow \mathbb{R}$ , reflecting the Markov property—rewards depend only on the current state and action, not on the entire history.

State modeling in these traditional frameworks assumes that all necessary information for future prediction and optimal decision making is encoded in the state variable. However, in many real-world domains—such as planning tasks with temporally extended goals, dialog management, or financial risk management—this assumption does not hold. Non-Markovian rewards, transition-based rewards, latent state abstractions, and history-augmented observation spaces have been introduced to better reflect domain requirements (1301.0606, 1612.02088).

Reward modeling extends beyond immediate feedback: it includes specifying how feedback depends on histories, transitions, or predicted future events; handling sparse, risk-sensitive, or diminishing rewards; and aligning scalar feedback with complex task requirements or human preferences (1612.02088, 2309.03710, 2505.02387).

2. Non-Markovian Rewards and Temporal Logic Representations

Many tasks require representing rewards that depend not just on the current state and action but on entire histories—examples include rewarding an agent only after it achieves a sequence of subgoals or events. Formally, a non-Markovian Reward Decision Process (NMRDP) has a reward function $R: S^* \rightarrow \mathbb{R}$ mapping state sequences to real numbers (1301.0606).

To compactly encode such complex reward behaviors, temporal logics—especially extensions of future linear temporal logic ( $FLTL)—have been used [1301.0606]. Reward functions can be specified as sets of pairs$ (f : r) $, where$ f $is an$ FLTL $formula (e.g., “reward the first occurrence of$ p $”) and$ r $is a scalar. The current reward at a history is calculated using minimal “stingy” allocation—rewards delivered only when strictly necessary for satisfaction of$ f $.$ FLTL $introduces temporal modalities such as “next” (O) and “until” (U), and a special reward constant ($ ) to precisely time reward allocation. The formula progression algorithm “pushes” $FLTL$ formulas forward as new states are encountered, embedding temporal model-checking directly into the solution process. This approach allows expressing and efficiently reasoning about a broad class of temporally extended tasks without explicit history enumeration.

Non-Markovian rewards can also be modeled using automata-theoretic approaches—Mealy reward machines synchronize an automaton over observation histories with the MDP, producing an augmented state space over which immediate (Markovian) rewards can be computed and standard solution techniques can be applied (2001.09293, 2009.12600).

3. State and Reward Augmentation Techniques

Mapping non-Markovian or transition-based reward problems to tractable solution spaces often requires state and reward augmentation. A prominent method involves constructing an expanded MDP (XMDP) whose states include both the base environment state and sufficient history information (e.g., a progressed temporal logic reward specification). Each “e-state” is a tuple $(s, \varphi)$ , where $s \in S$ and $\varphi$ encodes the current progress through the reward logic or automaton (1301.0606).

For risk-sensitive settings—especially those involving distributional objectives like Value-at-Risk (VaR)—reward functions frequently depend on both the current and next state, i.e., $r: S \times A \times S \rightarrow \mathbb{R}$ . Simplifying to state-based rewards via $r'(x, a) = \sum_{y} r(x, a, y) p(y|x, a)$ is valid for expectation objectives, but not for distributional or risk-sensitive objectives: such simplification can alter the reward distribution and lead to suboptimal VaR policies (1612.02088).

Accurate risk evaluation may require a transformation such as lifting the state space to $S^\dagger = S \times S$ so that the reward function can be cast as state-based, but with the full reward distribution (not just its expectation) preserved. This state-transition transformation ensures the total reward distribution is retained, enabling the application of spectral and central limit theorems for reward distribution approximation (1612.02088).

Augmentation is often computationally intensive because the expanded state space may scale with $|S|^2$ or more, but is critical for such settings as risk-sensitive planning, temporal goal satisfaction, and safe RL.

4. Deep Learning and Joint State–Reward Prediction

Environments with high-dimensional or complex observations (such as video games or robotics) demand scalable modeling. Deep learning approaches have enabled simultaneously learning environment dynamics and reward structure via trainable models (1611.07078).

A joint optimization framework can predict both next-frame states and future rewards, using a compound loss combining frame reconstruction and reward prediction (e.g., squared error for frames and cross-entropy for rewards). Shared architectures extract visual features that are directly relevant to reward, improving parameter efficiency and generalization. Empirical results demonstrate that such models accurately forecast cumulative reward over long horizons, which is crucial for sample-efficient model-based RL and for enabling simulated planning (Dyna, Monte Carlo tree search) without always reconstructing high-dimensional signals.

Other advances include self-supervised reward-predictive state representations, learned by predicting long-term rewards from raw observations, and then repurposing these representations both as RL agent input and in potential-based reward shaping strategies, further improving sample efficiency and convergence (2105.03172).

5. Latent and Abstract State Representations Coupled with Reward Prediction

A recent trend involves learning state representations directly optimized for predictive accuracy of rewards, rather than for state reconstruction or maximum likelihood (1912.04201). In these frameworks, an encoder transforms raw environment states $s_t$ to a latent code $z_t$ , and a reward-predictive dynamics model propagates $z_t$ forward under the current policy. The system is trained purely to minimize error in multi-step reward prediction, producing a highly compact state abstraction that discards irrelevant environment details.

Latent models learned in this way are particularly effective at filtering spurious or distractor features, focusing only on components of the state space essential for reward. This approach achieves robust sample efficiency and planning performance, even in complex or partially observed tasks.

6. Automated and Structured Reward Modeling for Complex Tasks

Applications such as dialog management, RL from human feedback (RLHF), and safety-critical control often require sophisticated reward modeling frameworks:

In dialog systems, multi-level reward modeling decomposes feedback hierarchically (e.g., domain, act, slot), employing adversarial or inverse RL to disentangle fine-grained reward signals (2104.04748). Sequential propagation mechanisms ensure lower-level decisions are only rewarded when higher-level decisions are also correct, supporting accuracy and interpretability.
In LLM alignment, structured reward modeling as a reasoning task (via Reasoning Reward Models or a chain-of-rubrics mechanism) has produced state-of-the-art evaluators that justify reward judgments through explicit, interpretable reasoning chains. This design improves alignment with human preference and enhances transparency (2505.02387).
Lightweight reward modeling for best-of-N LLM sampling leverages the high-dimensional internal representations of generative models, producing efficient, accurate scalar rewards by linear combination of hidden states (2505.12225).

7. Challenges, Practical Considerations, and Implications

Contemporary research has uncovered several challenges and best practices in state and reward modeling:

State augmentation and automata-based transformations are necessary but computationally costly; they should be used judiciously, especially for large-scale or real-time applications.
State representation should not merely reconstruct the environment, but capture only reward-relevant features when planning and decision making is the goal. This is particularly critical in high-dimensional or partially observable environments (1912.04201).
Reward shaping can improve learning but must preserve the optimal policy set; techniques such as potential-based shaping and predictor-based shaping are effective in practice (2105.03172, 2308.14919).
Risk-sensitive and distributional objectives require preserving the full trajectory reward distribution, not just expectations; improper simplification can produce misleading estimates and suboptimal policies (1612.02088).
Automated reward modeling—from observational data without access to actions—shows promise for generalizable RL and robotics (1806.01267).
Scalability is a persistent challenge; parameter-efficient reward models and efficient implementation of automata-augmented state spaces are important for practical deployment (2505.12225).
Safety and multi-objective scenarios necessitate augmenting the MDP with reset dynamics or vector-valued reward models, and require specialized planning and evaluation algorithms (2308.14919).

Conclusion

State and reward modeling is a central aspect of modern reinforcement learning, underpinning both theoretical developments and practical system design. Advances in temporal logic specification, reward automata, deep joint prediction, potential-based shaping, and automated reward inference have expanded the breadth and depth of problems that can be efficiently and accurately solved. As real-world applications demand more complex, risk-sensitive, and sophisticated agent behaviors, effective state and reward modeling methodologies are increasingly critical in aligning RL solutions with domain requirements and operational constraints.