Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stochastic Reward Machines (SRMs)

Updated 18 May 2026
  • Stochastic Reward Machines (SRMs) are formal models that encode non-Markovian reward functions using probabilistic state transitions and reward outputs.
  • They extend deterministic reward machines by internalizing noise and providing exponential succinctness, which is crucial for modeling complex reward processes.
  • Learning algorithms for SRMs, including sampling-based and constraint-solving approaches, ensure convergence to models that are expectation-equivalent to the true reward process in noisy environments.

A Stochastic Reward Machine (SRM)—also described in the literature as a Probabilistic Reward Machine (PRM)—is a formalism for representing non-Markovian and stochastic reward functions in reinforcement learning (RL) environments. SRMs generalize deterministic reward machines (RMs), which encode arbitrary history-dependent rewards via Mealy automata, by admitting stochastic transitions and/or outputs, thereby capturing the semantics of noisy or inherently probabilistic reward processes. Precise representation and learning of such machines enable RL algorithms to operate in complex, temporally abstract, and noisy task domains with clear theoretical guarantees.

1. Formal Models of Stochastic Reward Machines

Multiple variants of stochastic reward machines have appeared in the literature, all aiming to augment the classical RM framework by explicitly encoding stochasticity in state transitions or reward outputs. Two principal formalisms are:

  • Probabilistic Reward Machines (PRMs) (Dohmen et al., 2021): Defined as tuples (AP,Γ,Y,yI,τ,ϱ)(AP, \Gamma, Y, y_I, \tau, \varrho), where APAP is a set of atomic propositions (input alphabet Σ=2AP\Sigma=2^{AP}), Γ\Gamma is a finite set of possible reward values, YY is the machine's state set with initial state yIy_I, τ:Y×2AP×Y[0,1]\tau: Y \times 2^{AP} \times Y \rightarrow [0,1] is the probabilistic transition function (yYτ(y,,y)=1\sum_{y' \in Y} \tau(y,\ell, y') = 1), and ϱ:Y×2AP×YΓ\varrho: Y \times 2^{AP} \times Y \rightarrow \Gamma attaches a reward for a given transition. The overall machine models a stochastic process over observations and rewards.
  • Stochastic Reward Machines (SRMs) (Corazza et al., 16 Oct 2025): Defined as tuples (V,v0,Σ,δ,O,τ)(V, v_0, \Sigma, \delta, O, \tau), where APAP0 is a set of states, APAP1 is the initial state, APAP2 is the input alphabet, APAP3 is a possibly deterministic transition function, APAP4 is a set of reward distribution objects (typically CDFs over APAP5), and APAP6 maps transitions to reward distributions. Here any transition emits rewards sampled from its associated distribution.

For both models, the semantics is: on reading a label sequence, the automaton traverses a sequence of states, each transition marking the step with a stochastic transition (potentially) and sampling a reward from the corresponding distribution; overall, the SRM induces a joint distribution over observed reward sequences.

A key point is that classical deterministic RMs are strictly special cases of SRMs/PRMs, corresponding to settings where all reward distributions are Dirac delta measures.

2. Extension over Deterministic Reward Machines

SRMs and PRMs extend the expressive power of classical reward machines in several fundamental ways:

  • Internalizing Noise: Whereas deterministic RMs require all reward randomization (e.g., environmental stochasticity, measurement noise) to be encoded within the environment, SRMs attach stochasticity directly to reward transitions. This yields a more compact and transparent representation for reward functions that are stochastic but structure-preserving (Dohmen et al., 2021, Corazza et al., 16 Oct 2025).
  • Succinctness: Theoretical results show that reward-deterministic PRMs with APAP7 states may require exponentially larger deterministic RMs to capture all possible label–reward traces, demonstrating the exponential succinctness advantage of the stochastic encoding [(Dohmen et al., 2021), Theorem 5.3].
  • Expectation Equivalence: Two SRMs APAP8 and APAP9 are equivalent in expectation (Σ=2AP\Sigma=2^{AP}0) if for every label trace, their expected rewards agree at all positions. This property, shown to suffice for Q-learning optimality, allows learning algorithms to focus on matching expected values rather than entire reward distributions (Corazza et al., 16 Oct 2025).
  • Noise Generalization: SRMs can explicitly model bounded or zero-mean noise on transitions, supporting rich real-world RL domains with imperfect observations or inherently variable rewards (Corazza et al., 16 Oct 2025).

3. Learning Algorithms for SRMs/PRMs

Learning stochastic reward machines from interaction data necessitates new algorithmic approaches:

Sampling-based Σ=2AP\Sigma=2^{AP}1-Style Learning

Adapted from Angluin’s Σ=2AP\Sigma=2^{AP}2 automata learning, the PRM algorithm (Dohmen et al., 2021) maintains a sampling observation table over label–reward sequences and detects statistical equivalence between sample traces using methods such as the Hoeffding test. Key steps:

  • Construct observation sets Σ=2AP\Sigma=2^{AP}3 (prefix-closed label–reward words) and Σ=2AP\Sigma=2^{AP}4 (suffix-closed experiments).
  • For each pair of samples, check empirically if their trace distributions are statistically different.
  • Learn a hypothesis PRM by normalizing observed frequencies for transitions and reward assignments.
  • Use RL-based equivalence queries to detect counterexamples, iteratively refining the hypothesis.
  • Continue until the model is closed, consistent, and no counterexample is found (see detailed pseudocode in (Dohmen et al., 2021)).

This methodology guarantees (with sufficient exploration) almost sure convergence to a PRM encoding the true reward process or an equivalent machine over observable sequences, provided episodes are of length Σ=2AP\Sigma=2^{AP}5 for Σ=2AP\Sigma=2^{AP}6 the number of states (Theorem 5.2).

Constraint-Solving with SRMI

The SRMI (Stochastic Reward Machine Inference) algorithm (Corazza et al., 16 Oct 2025) frames learning as a sequence of constraint satisfaction problems:

  • Interleave RL episodes (Q-learning with current SRM hypothesis), trace collection, detection of Σ=2AP\Sigma=2^{AP}7-inconsistencies, and repair via SMT encoding.
  • Each repair either adjusts distribution means or discovers a new automaton structure with increased state space, maintaining consistency with all gathered counterexamples.
  • Final estimates are computed by midrange aggregation of observed rewards.
  • The approach is proven to converge (under non-containment and Σ=2AP\Sigma=2^{AP}8-greedy exploration) to a minimal SRM equivalent in expectation to the true environment.

A distinctive feature is the explicit use of SMT solvers to discover the minimal SRM structure consistent with observed history-rich data.

4. Theoretical Properties and Guarantees

Theoretical results underpin the use of SRMs/PRMs in reinforcement learning:

  • Product Construction Correctness: The cross-product of an underlying MDP and an SRM (or PRM) yields an augmented MDP whose reward process is Markovian and whose joint distribution over label–reward traces matches the original process (Dohmen et al., 2021).
  • Convergence: Both the sampling-based and constraint-based algorithms converge to machines that are either exactly or expectation-equivalent (in the sense discussed above) to the true underlying SRM, provided sufficient exploration and well-posedness (non-containment, bounded noise) (Dohmen et al., 2021, Corazza et al., 16 Oct 2025).
  • Policy Optimality: For any RL method such as Q-learning, if the reward structure provided by the SRM/PRM matches the true reward process in expectation, learned optimal policies are also optimal for the ground-truth environment (cf. Lemma 1 in (Corazza et al., 16 Oct 2025)).
  • Sample Complexity and Efficiency: Empirical studies show that learning with SRMs, versus classical RMs or baseline methods (e.g., “replay counterexamples and average”), can yield faster and more reliable convergence in the presence of reward noise, and avoids machine size explosion when fitting noisy data (Corazza et al., 16 Oct 2025).

5. Applications, Empirical Results, and Case Studies

SRMs/PRMs have been validated on several benchmark RL domains with structured, non-Markovian, and noisy rewards:

Mining Example (Corazza et al., 16 Oct 2025): SRMs modeled fluctuating final rewards due to variable ore quality and market prices—transitions output rewards sampled uniformly from intervals. The SRMI algorithm rapidly discovers the minimal-state SRM and achieves optimal average reward, outperforming deterministic baselines.

Harvesting Example (Corazza et al., 16 Oct 2025): In weather-influenced environments, reward on certain transitions (e.g., Harvest) is contingent on an exogenous stochastic process (Good/Med/Bad weather). SRMI reconstructs the optimal SRM, while baselines either time out or require extensive replay.

Office Gridworld Coffee Task (Dohmen et al., 2021): A PRM captures stochastic failure in the coffee machine, succinctly encoding the process in a minimal 4-state automaton and discovering it in under a minute of RL simulation.

Stochastic Games (Hu et al., 2023): Multi-agent RL with non-Markovian rewards encoded as RMs shows convergence to Nash equilibria in stochastic grid games. The QRM-SG algorithm, operating in the augmented (state, RM-state) space, achieves equilibrium sample efficiency unattainable by generic Nash Q-learning or MADDPG, especially as task complexity or noise increases.

These examples substantiate the empirical benefits of exposing reward machine structure (and stochasticity) for RL agents.

6. Integration with RL Algorithms and Broader Implications

SRMs/PRMs directly extend the toolkit available for RL in environments with:

  • Sparse and temporally extended rewards: Tasks requiring global constraints or temporal specifications become amenable to compact automata-theoretic reward encoding.
  • Noisy real-world domains: SRMs can explicitly model measurement error or environmental randomness in rewards.
  • Non-Markovian dependencies: SRMs augment the state of MDPs, converting non-Markovian RL problems into Markovian ones in the product space, with guaranteed preservation of reward semantics.
  • Multi-agent settings: Augmenting the agent-state space with SRM/RM state enables joint learning over complex tasks, including Nash equilibria in stochastic games (Hu et al., 2023).

The learning and product construction routines for SRMs can be paired with standard RL algorithms (e.g., Q-learning), requiring only minor modifications to account for augmented states and reward sampling.

7. Open Problems and Limitations

While SRMs/PRMs provide a principled formalism for noisy and non-Markovian rewards, several limitations and research directions remain:

  • Statistical test sensitivity: Purely data-driven distinction of equivalence classes (e.g., via Hoeffding tests) can slow learning in rare-event settings; availability of an oracle for language recognition could accelerate convergence (Dohmen et al., 2021).
  • Scalability: Extending algorithms from toy domains to large-scale RL requires sophisticated exploration schedules and potentially hybridization with symbolic methods (Dohmen et al., 2021, Corazza et al., 16 Oct 2025).
  • Alternative similarity notions: Investigating different metrics for “difference” between empirical distributions or compatibility classes could yield improved empirical results (Dohmen et al., 2021).
  • SMT solving bottlenecks: Constraint-based learning relies on efficient SMT formulations; performance may degrade as problem size increases (Corazza et al., 16 Oct 2025).
  • Multi-agent SRMs: While the RM formalism is applied in multi-agent settings, generalization to multi-agent stochastic reward machines remains an active research direction.

SRMs/PRMs constitute a robust, theoretically grounded extension of reward machines for RL, supporting policy optimality and empirical tractability for a broad class of complex, stochastic, and non-Markovian tasks (Dohmen et al., 2021, Corazza et al., 16 Oct 2025, Hu et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stochastic Reward Machines (SRMs).