Papers
Topics
Authors
Recent
Search
2000 character limit reached

Probabilistic Reward Machine OmegaPRM

Updated 13 April 2026
  • Probabilistic Reward Machine OmegaPRM is a framework for modeling structured, stochastic, non-Markovian rewards through a probabilistic automaton with active structure learning.
  • It employs a product MDP construction to reduce non-Markovian reward processes to a Markovian setting, enabling efficient policy optimization and sample-efficient learning.
  • OmegaPRM utilizes an L*-style active learning algorithm to infer both the structure and parameters of the reward machine, providing strong theoretical and empirical guarantees.

A Probabilistic Reward Machine (PRM), and its actively learned variant OmegaPRM, is a framework for representing and learning structured, stochastic, non-Markovian reward processes in sequential decision making and reinforcement learning. PRMs generalize classic deterministic reward machines by permitting stochastic automaton transitions, thereby capturing complex stochastic dependencies in reward signals. This framework enables the reduction of non-Markovian reward problems to Markovian ones via product MDP constructions, supporting both efficient policy optimization and sample-efficient, provably correct structure learning. OmegaPRM denotes both the class of PRM models (occasionally written ΩPRM in the literature) and the L*-style active learning algorithm for inferring a PRM's structure and parameters from environment interactions, with strong theoretical and empirical guarantees in both standard RL and reasoning-over-sequences contexts.

1. Formal Definition of Probabilistic Reward Machines

A Probabilistic Reward Machine is a tuple

R=(U,u0,F,Σ,δ,ρ)R = (U, u_0, F, \Sigma, \delta, \rho)

where:

  • UU is a finite set of automaton (memory) states;
  • u0Uu_0 \in U is the distinguished initial state;
  • FUF \subseteq U is a set of accepting states (present in the canonical definition, but often F=F = \emptyset in continuous or average-reward settings);
  • Σ\Sigma is a finite alphabet of event labels (typically 2P2^\mathcal{P} for a proposition set P\mathcal{P} in MDPs);
  • δ:U×ΣD(U)\delta: U \times \Sigma \to \mathcal{D}(U) is a probabilistic transition function, with D(U)\mathcal{D}(U) denoting all distributions over UU0;
  • UU1 is a deterministic reward function (alternatively, UU2 may map to a reward distribution, but in standard PRMs, each transition has a single reward).

At each step, the PRM, in state UU3 and observing symbol UU4, samples its next state UU5 from UU6 and emits reward UU7. Reward-determinism requires that distinct transitions on the same input/output pair UU8 do not share the same reward unless they merge to a single UU9.

This formalism generalizes deterministic reward machines (DRMs), where u0Uu_0 \in U0 is a Dirac (point mass) transition and all transitions are deterministic (Dohmen et al., 2021, Lin et al., 2024, Bourel et al., 2024).

2. Semantics: Markovianizing Stochastic Non-Markovian Rewards

A PRM encodes a non-Markovian, possibly stochastic reward signal u0Uu_0 \in U1, such that for every observed label sequence u0Uu_0 \in U2 and reward sequence u0Uu_0 \in U3, the PRM induces the joint probability:

u0Uu_0 \in U4

where u0Uu_0 \in U5 is fixed. This is constructed to match the product of history-conditioned reward distributions:

u0Uu_0 \in U6

Thus, PRMs can represent any (reward-deterministic) non-Markovian, stochastic reward process that admits this causal factorization.

The crucial consequence is that, by augmenting the MDP state with the PRM state, the combined process is Markovian over the cross-product u0Uu_0 \in U7. In this way, model-free and model-based RL methods for Markovian environments can be leveraged directly (Dohmen et al., 2021).

3. Product MDP Construction and Policy Optimization

For an MDP u0Uu_0 \in U8 with transition kernel u0Uu_0 \in U9 and labeling function FUF \subseteq U0, and a PRM FUF \subseteq U1, the product MDP FUF \subseteq U2 is defined by:

  • FUF \subseteq U3, FUF \subseteq U4,
  • FUF \subseteq U5,
  • FUF \subseteq U6.

The optimal FUF \subseteq U7-function on FUF \subseteq U8 satisfies the Bellman equation:

FUF \subseteq U9

where F=F = \emptyset0 (Dohmen et al., 2021).

When the reward function is non-Markovian but specified by a (possibly unknown) PRM, this construction permits the use of any dynamic programming or RL algorithm in the product space. In practical contexts, the explicit cross-product can be large, but the PRM structure often yields factorizations or variance reductions in planning and exploration (Lin et al., 2024, Bourel et al., 2024).

4. OmegaPRM: Active Inference of PRM Structure and Parameters

OmegaPRM refers to an active, sampling-based algorithm for learning the structure and parameters of an unknown PRM from interaction data (Dohmen et al., 2021). The approach adapts Angluin’s L* automata inference to the stochastic setting and interleaves three phases:

  1. Structure Learning: An observation table maintains prefix samples and suffix experiments, tallying empirical reward traces. Closedness requires that extending any prefix by a new (symbol, reward) yields a row statistically equivalent to one already present, and consistency ensures similar prefixes behave similarly under all experiments. Statistically indistinguishable rows are merged, while distinguishable ones prompt new experiments.
  2. Parameter Estimation: Once a transition component occurs at least F=F = \emptyset1 times, its probability and reward are estimated as relative frequencies. Otherwise, the sink state is used.
  3. Query Answering via RL:
    • Membership queries ask for the reward distribution along a specific string; these are answered via model-free RL on the product MDP.
    • Equivalence queries simulate RL on the current PRM hypothesis; rewards absent from the hypothesis but realized in the environment are used as counterexamples for refining the hypothesis.

Convergence is almost sure: under sufficient exploration (all label sequences in the true PRM's support appear infinitely often), and with query episodes of length at least F=F = \emptyset2 for a F=F = \emptyset3-state target PRM, OmegaPRM's inferred model almost surely converges to the true (up to isomorphism) on the language of the underlying MDP. The key structural lemma is that two distinct PRMs of size F=F = \emptyset4 can be distinguished by a test of length at most F=F = \emptyset5 (Dohmen et al., 2021).

OmegaPRM's sample and computational complexity depend polynomially on F=F = \emptyset6 and F=F = \emptyset7 for precision F=F = \emptyset8 and confidence F=F = \emptyset9, but exponentially on the true PRM size in the worst case.

Empirical results (probabilistic gridworlds with 5×5 state spaces and moderate machine failure probabilities) show sub-minute convergence and runtimes increasing linearly with the number of nonzero PRM transitions and exponentially with the number of PRM states (Dohmen et al., 2021).

5. Regret and Sample Complexity in RL with ΩPRMs

In model-based reinforcement learning with known or unknown ΩPRM structure, regret is a central metric for measuring sample efficiency (Bourel et al., 2024, Lin et al., 2024). A product MDP between the (possibly unknown) environment and known/learned ΩPRM is constructed to facilitate planning and exploration. The following are salient points:

  • Regret Bounds:

With known ΩPRM and unknown environment, extended value iteration (EVI)-based algorithms can achieve regret Σ\Sigma0, where Σ\Sigma1 is the number of observation states, Σ\Sigma2 is the action set, Σ\Sigma3 the automaton states, and Σ\Sigma4 the label alphabet. This is provably better than the generic upper bound Σ\Sigma5 one would incur without exploiting product structure (Bourel et al., 2024).

  • Efficient Algorithms:

UCBVI-PRM builds on standard UCBVI but leverages disjoint empirical counts for the environment and PRM components, and incorporates Bernstein-style confidence sets for tight exploration bonuses. The leading term in the regret is Σ\Sigma6, where Σ\Sigma7 is the episode horizon (Lin et al., 2024). This matches the lower bounds for deterministic reward machines up to polylogarithmic factors and holds for PRMs by the inclusion of stochastic automaton transitions.

  • Exploitability of PRM Structure:

These results follow from decoupling uncertainty over the environment's transitions and the PRM's automaton transitions, rather than treating their cross-product monolithically (Bourel et al., 2024, Lin et al., 2024). In deterministic PRMs, further refinements can lower constants.

  • Simulation Lemma for Non-Markovian Rewards:

Generalization of classic simulation lemmas to non-Markovian settings provides value-approximation guarantees for arbitrary history-dependent reward processes and enables reward-free exploration with sufficient state-covering (Lin et al., 2024).

Empirical evaluations confirm that algorithms exploiting the PRM factorization outperform baselines on grid-world "patrol" and stochastic warehouse pickup–delivery tasks, especially as horizon and observation set size increase (Lin et al., 2024).

6. OmegaPRM in Step-Level Process Supervision for LLM Reasoning

In LLM reasoning, OmegaPRM has been adapted as a fully automated MCTS-based process supervision pipeline for fine-grained, step-aware reward assignment in multi-step reasoning tasks (Luo et al., 2024). Key components include:

A PRM assigns a per-step correctness probability Σ\Sigma8 for each step Σ\Sigma9 in a chain-of-thought. This supports reranking or filtering CoTs via the aggregate score 2P2^\mathcal{P}0.

  • Automated Data Generation via OmegaPRM:

OmegaPRM builds a state-action tree over CoT prefixes, uses Monte Carlo rollouts to estimate per-node correctness, and employs divide-and-conquer binary search to efficiently identify the first step irreparably damaging the answer. This procedure yields over 1.5 million per-step supervision labels with zero human annotation cost, balancing positive and negative steps via PUCT-based exploration-value prioritization.

  • Outcomes:

This automated process supervision and PRM training yields significant improvements (e.g., 51% to 69.4% on MATH500) in LLM mathematical reasoning accuracy, outperforming prior process supervision datasets and continuing to improve as more supervision data are collected (Luo et al., 2024).

7. Extensions, Limitations, and Future Directions

ΩPRMs provide a flexible foundation for structured non-Markovian RL, but several open problems and extensions remain:

  • Partial Observability:

If the PRM state is unobserved, the problem reduces to a partially observable MDP (POMDP), increasing algorithmic complexity (Bourel et al., 2024).

  • Hierarchical and Compositional Models:

Hierarchical ΩPRMs, constructed by composing multiple machines, may support recursive regret analyses and enable modular representations for complex tasks (Bourel et al., 2024).

  • Support for rich reward distributions:

While classical PRMs are reward-deterministic, extensions to reward-distribution-valued transitions are possible.

  • Sample and computational complexity:

While polynomial in 2P2^\mathcal{P}1 and logarithmic in 2P2^\mathcal{P}2, structure learning can be exponential in PRM size and the alphabet when distinguishing suffixes are numerous (Dohmen et al., 2021).

The PRM and OmegaPRM frameworks unify, under one formalism, recent advances in automata-theoretic RL, structure inference for non-Markovian rewards, and automated process supervision for reasoning tasks, thereby serving as a central tool in the design and analysis of modern sequential decision systems (Dohmen et al., 2021, Bourel et al., 2024, Lin et al., 2024, Luo et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Probabilistic Reward Machine OmegaPRM.