Probabilistic Reward Machine OmegaPRM

Updated 13 April 2026

Probabilistic Reward Machine OmegaPRM is a framework for modeling structured, stochastic, non-Markovian rewards through a probabilistic automaton with active structure learning.
It employs a product MDP construction to reduce non-Markovian reward processes to a Markovian setting, enabling efficient policy optimization and sample-efficient learning.
OmegaPRM utilizes an L*-style active learning algorithm to infer both the structure and parameters of the reward machine, providing strong theoretical and empirical guarantees.

A Probabilistic Reward Machine (PRM), and its actively learned variant OmegaPRM, is a framework for representing and learning structured, stochastic, non-Markovian reward processes in sequential decision making and reinforcement learning. PRMs generalize classic deterministic reward machines by permitting stochastic automaton transitions, thereby capturing complex stochastic dependencies in reward signals. This framework enables the reduction of non-Markovian reward problems to Markovian ones via product MDP constructions, supporting both efficient policy optimization and sample-efficient, provably correct structure learning. OmegaPRM denotes both the class of PRM models (occasionally written ΩPRM in the literature) and the L*-style active learning algorithm for inferring a PRM's structure and parameters from environment interactions, with strong theoretical and empirical guarantees in both standard RL and reasoning-over-sequences contexts.

1. Formal Definition of Probabilistic Reward Machines

A Probabilistic Reward Machine is a tuple

$R = (U, u_0, F, \Sigma, \delta, \rho)$

where:

$U$ is a finite set of automaton (memory) states;
$u_0 \in U$ is the distinguished initial state;
$F \subseteq U$ is a set of accepting states (present in the canonical definition, but often $F = \emptyset$ in continuous or average-reward settings);
$\Sigma$ is a finite alphabet of event labels (typically $2^\mathcal{P}$ for a proposition set $\mathcal{P}$ in MDPs);
$\delta: U \times \Sigma \to \mathcal{D}(U)$ is a probabilistic transition function, with $\mathcal{D}(U)$ denoting all distributions over $U$ 0;
$U$ 1 is a deterministic reward function (alternatively, $U$ 2 may map to a reward distribution, but in standard PRMs, each transition has a single reward).

At each step, the PRM, in state $U$ 3 and observing symbol $U$ 4, samples its next state $U$ 5 from $U$ 6 and emits reward $U$ 7. Reward-determinism requires that distinct transitions on the same input/output pair $U$ 8 do not share the same reward unless they merge to a single $U$ 9.

This formalism generalizes deterministic reward machines (DRMs), where $u_0 \in U$ 0 is a Dirac (point mass) transition and all transitions are deterministic (Dohmen et al., 2021, Lin et al., 2024, Bourel et al., 2024).

2. Semantics: Markovianizing Stochastic Non-Markovian Rewards

A PRM encodes a non-Markovian, possibly stochastic reward signal $u_0 \in U$ 1, such that for every observed label sequence $u_0 \in U$ 2 and reward sequence $u_0 \in U$ 3, the PRM induces the joint probability:

$u_0 \in U$ 4

where $u_0 \in U$ 5 is fixed. This is constructed to match the product of history-conditioned reward distributions:

$u_0 \in U$ 6

Thus, PRMs can represent any (reward-deterministic) non-Markovian, stochastic reward process that admits this causal factorization.

The crucial consequence is that, by augmenting the MDP state with the PRM state, the combined process is Markovian over the cross-product $u_0 \in U$ 7. In this way, model-free and model-based RL methods for Markovian environments can be leveraged directly (Dohmen et al., 2021).

3. Product MDP Construction and Policy Optimization

For an MDP $u_0 \in U$ 8 with transition kernel $u_0 \in U$ 9 and labeling function $F \subseteq U$ 0, and a PRM $F \subseteq U$ 1, the product MDP $F \subseteq U$ 2 is defined by:

$F \subseteq U$ 3, $F \subseteq U$ 4,
$F \subseteq U$ 5,
$F \subseteq U$ 6.

The optimal $F \subseteq U$ 7-function on $F \subseteq U$ 8 satisfies the Bellman equation:

$F \subseteq U$ 9

where $F = \emptyset$ 0 (Dohmen et al., 2021).

When the reward function is non-Markovian but specified by a (possibly unknown) PRM, this construction permits the use of any dynamic programming or RL algorithm in the product space. In practical contexts, the explicit cross-product can be large, but the PRM structure often yields factorizations or variance reductions in planning and exploration (Lin et al., 2024, Bourel et al., 2024).

4. OmegaPRM: Active Inference of PRM Structure and Parameters

OmegaPRM refers to an active, sampling-based algorithm for learning the structure and parameters of an unknown PRM from interaction data (Dohmen et al., 2021). The approach adapts Angluin’s L* automata inference to the stochastic setting and interleaves three phases:

Structure Learning: An observation table maintains prefix samples and suffix experiments, tallying empirical reward traces. Closedness requires that extending any prefix by a new (symbol, reward) yields a row statistically equivalent to one already present, and consistency ensures similar prefixes behave similarly under all experiments. Statistically indistinguishable rows are merged, while distinguishable ones prompt new experiments.
Parameter Estimation: Once a transition component occurs at least $F = \emptyset$ 1 times, its probability and reward are estimated as relative frequencies. Otherwise, the sink state is used.
Query Answering via RL:
- Membership queries ask for the reward distribution along a specific string; these are answered via model-free RL on the product MDP.
- Equivalence queries simulate RL on the current PRM hypothesis; rewards absent from the hypothesis but realized in the environment are used as counterexamples for refining the hypothesis.

Convergence is almost sure: under sufficient exploration (all label sequences in the true PRM's support appear infinitely often), and with query episodes of length at least $F = \emptyset$ 2 for a $F = \emptyset$ 3-state target PRM, OmegaPRM's inferred model almost surely converges to the true (up to isomorphism) on the language of the underlying MDP. The key structural lemma is that two distinct PRMs of size $F = \emptyset$ 4 can be distinguished by a test of length at most $F = \emptyset$ 5 (Dohmen et al., 2021).

OmegaPRM's sample and computational complexity depend polynomially on $F = \emptyset$ 6 and $F = \emptyset$ 7 for precision $F = \emptyset$ 8 and confidence $F = \emptyset$ 9, but exponentially on the true PRM size in the worst case.

Empirical results (probabilistic gridworlds with 5×5 state spaces and moderate machine failure probabilities) show sub-minute convergence and runtimes increasing linearly with the number of nonzero PRM transitions and exponentially with the number of PRM states (Dohmen et al., 2021).

5. Regret and Sample Complexity in RL with ΩPRMs

In model-based reinforcement learning with known or unknown ΩPRM structure, regret is a central metric for measuring sample efficiency (Bourel et al., 2024, Lin et al., 2024). A product MDP between the (possibly unknown) environment and known/learned ΩPRM is constructed to facilitate planning and exploration. The following are salient points:

Regret Bounds:

With known ΩPRM and unknown environment, extended value iteration (EVI)-based algorithms can achieve regret $\Sigma$ 0, where $\Sigma$ 1 is the number of observation states, $\Sigma$ 2 is the action set, $\Sigma$ 3 the automaton states, and $\Sigma$ 4 the label alphabet. This is provably better than the generic upper bound $\Sigma$ 5 one would incur without exploiting product structure (Bourel et al., 2024).

Efficient Algorithms:

UCBVI-PRM builds on standard UCBVI but leverages disjoint empirical counts for the environment and PRM components, and incorporates Bernstein-style confidence sets for tight exploration bonuses. The leading term in the regret is $\Sigma$ 6, where $\Sigma$ 7 is the episode horizon (Lin et al., 2024). This matches the lower bounds for deterministic reward machines up to polylogarithmic factors and holds for PRMs by the inclusion of stochastic automaton transitions.

Exploitability of PRM Structure:

These results follow from decoupling uncertainty over the environment's transitions and the PRM's automaton transitions, rather than treating their cross-product monolithically (Bourel et al., 2024, Lin et al., 2024). In deterministic PRMs, further refinements can lower constants.

Simulation Lemma for Non-Markovian Rewards:

Generalization of classic simulation lemmas to non-Markovian settings provides value-approximation guarantees for arbitrary history-dependent reward processes and enables reward-free exploration with sufficient state-covering (Lin et al., 2024).

Empirical evaluations confirm that algorithms exploiting the PRM factorization outperform baselines on grid-world "patrol" and stochastic warehouse pickup–delivery tasks, especially as horizon and observation set size increase (Lin et al., 2024).

6. OmegaPRM in Step-Level Process Supervision for LLM Reasoning

In LLM reasoning, OmegaPRM has been adapted as a fully automated MCTS-based process supervision pipeline for fine-grained, step-aware reward assignment in multi-step reasoning tasks (Luo et al., 2024). Key components include:

Process Reward Models (PRMs):

A PRM assigns a per-step correctness probability $\Sigma$ 8 for each step $\Sigma$ 9 in a chain-of-thought. This supports reranking or filtering CoTs via the aggregate score $2^\mathcal{P}$ 0.

Automated Data Generation via OmegaPRM:

OmegaPRM builds a state-action tree over CoT prefixes, uses Monte Carlo rollouts to estimate per-node correctness, and employs divide-and-conquer binary search to efficiently identify the first step irreparably damaging the answer. This procedure yields over 1.5 million per-step supervision labels with zero human annotation cost, balancing positive and negative steps via PUCT-based exploration-value prioritization.

Outcomes:

This automated process supervision and PRM training yields significant improvements (e.g., 51% to 69.4% on MATH500) in LLM mathematical reasoning accuracy, outperforming prior process supervision datasets and continuing to improve as more supervision data are collected (Luo et al., 2024).

7. Extensions, Limitations, and Future Directions

ΩPRMs provide a flexible foundation for structured non-Markovian RL, but several open problems and extensions remain:

Partial Observability:

If the PRM state is unobserved, the problem reduces to a partially observable MDP (POMDP), increasing algorithmic complexity (Bourel et al., 2024).

Hierarchical and Compositional Models:

Hierarchical ΩPRMs, constructed by composing multiple machines, may support recursive regret analyses and enable modular representations for complex tasks (Bourel et al., 2024).

Support for rich reward distributions:

While classical PRMs are reward-deterministic, extensions to reward-distribution-valued transitions are possible.

Sample and computational complexity:

While polynomial in $2^\mathcal{P}$ 1 and logarithmic in $2^\mathcal{P}$ 2, structure learning can be exponential in PRM size and the alphabet when distinguishing suffixes are numerous (Dohmen et al., 2021).

The PRM and OmegaPRM frameworks unify, under one formalism, recent advances in automata-theoretic RL, structure inference for non-Markovian rewards, and automated process supervision for reasoning tasks, thereby serving as a central tool in the design and analysis of modern sequential decision systems (Dohmen et al., 2021, Bourel et al., 2024, Lin et al., 2024, Luo et al., 2024).

Markdown Report Issue Upgrade to Chat

References (4)

Inferring Probabilistic Reward Machines from Non-Markovian Reward Processes for Reinforcement Learning (2021)

Efficient Reinforcement Learning in Probabilistic Reward Machines (2024)

Provably Efficient Exploration in Reward Machines with Low Regret (2024)

Improve Mathematical Reasoning in Language Models by Automated Process Supervision (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Probabilistic Reward Machine OmegaPRM.

Probabilistic Reward Machine OmegaPRM

1. Formal Definition of Probabilistic Reward Machines

2. Semantics: Markovianizing Stochastic Non-Markovian Rewards

3. Product MDP Construction and Policy Optimization

4. OmegaPRM: Active Inference of PRM Structure and Parameters

5. Regret and Sample Complexity in RL with ΩPRMs

6. OmegaPRM in Step-Level Process Supervision for LLM Reasoning

7. Extensions, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Probabilistic Reward Machine OmegaPRM

1. Formal Definition of Probabilistic Reward Machines

2. Semantics: Markovianizing Stochastic Non-Markovian Rewards

3. Product MDP Construction and Policy Optimization

4. OmegaPRM: Active Inference of PRM Structure and Parameters

5. Regret and Sample Complexity in RL with ΩPRMs

6. OmegaPRM in Step-Level Process Supervision for LLM Reasoning

7. Extensions, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research