Mealy Reward Machines

Updated 28 February 2026

Mealy Reward Machines are finite-state automata that encode non-Markovian reward functions and sequence-dependent task requirements in reinforcement learning.
They enable efficient subproblem decomposition by synchronizing RM states with environment states to restore the Markov property.
Active and passive learning methods infer and minimize RMs from experience, accelerating policy convergence in complex reinforcement learning scenarios.

Mealy Reward Machines (RMs) are formal automata-based structures designed to express non-Markovian reward functions in reinforcement learning (RL) and decision processes. They generalize the reward function beyond traditional MDP formulations by capturing dependencies on sequences of observations, events, or actions—thus naturally encoding tasks where the reward depends on history or high-level event traces. As deterministic finite-state transducers of the Mealy type, RMs have emerged as a prominent tool to imbue RL agents with structured, explicit task memory, support efficient subproblem decomposition, and facilitate learning policies in partially observable or temporally complex environments.

1. Formal Definition and Semantics

A Mealy Reward Machine is a tuple

$\mathcal{RM} = (U, u_0, \mathcal{I}, \mathcal{O}, \delta_U, \delta_R)$

where:

$U$ is a finite set of RM states.
$u_0$ is the initial RM state.
$\mathcal{I}$ is the input alphabet, typically factors of atomic propositions (AP), low-level observations ( $O$ ), and actions ( $A$ ), often formalized as $\mathcal{I} = 2^{\mathrm{AP} \times O \times A}$ .
$\mathcal{O} = \mathbb{R}$ is the output or reward alphabet.
$\delta_U: U \times \mathcal{I} \to U$ is the state-transition function.
$\delta_R: U \times \mathcal{I} \to \mathcal{O}$ is the reward output function (Wu et al., 3 Aug 2025, Icarte et al., 2021, Xu et al., 2019, Rens et al., 2020, Rens et al., 2020).

Given an input sequence $w = (i_1, i_2, \ldots, i_t) \in \mathcal{I}^t$ , the RM produces a reward for each step as follows: $u_{k} = \delta_U(u_{k-1}, i_k), \quad r_k = \delta_R(u_{k-1}, i_k)$ with $u_0$ as the starting state. Thus, the reward at step $k$ is a function of the entire prefix $i_1 \ldots i_k$ . This allows RMs to model non-Markovian reward dependencies: the reward at time $t$ may depend on events arbitrarily far in the agent’s past.

2. Non-Markovianity and Product-State Construction

Traditional MDPs are intrinsically Markovian; standard reward functions depend only on the current state and action. In practice, many tasks (e.g., “collect all keys before exiting”) are inherently non-Markovian. RMs address this by providing explicit history-tracking via their state space.

For learning and planning, RMs are often synchronized (via the product construction) with an MDP, yielding a product MDP $P = M \otimes_\ell \mathcal{RM}$ with states of the form $(s, u)$ , where

$s \in S$ is the environment state,
$u \in U$ is the RM state,
The transitions and rewards are defined such that both the environment and RM transition in synchrony according to the labeling function $\ell$ and RM transition/output rules (Rens et al., 2020, Icarte et al., 2021, Rens et al., 2020).

This product construction restores the Markov property at the product level: $\Pr(o_{t+1}, r_t~|~o_0, a_0, \ldots, o_t, a_t) = \Pr(o_{t+1}, r_t~|~u_t, o_t, a_t)$ allowing RL algorithms to operate efficiently.

3. Learning Reward Machines: Passive and Active Methods

Passive Learning

Passive RM inference is based on accumulated experience—observed traces and associated rewards. Algorithms such as RPNI (adapted to Mealy-transducers) build prefix-tree transducers from sample data and iteratively merge states consistent with observed outputs, yielding minimal RMs consistent with the data (Xu et al., 2019).

DB-RPNI extends this to a unified setting subsuming RMs and Transition Machines (TMs), operating directly on samples structured over two disjoint input alphabets ( $\alpha$ , $\beta$ ). DB-RPNI merges states based on local α-input compatibility and propagates merges via β-transitions, yielding minimal Dual-Behavior Mealy Machines efficiently (Wu et al., 3 Aug 2025). Under structure-completeness (coverage of all states and transitions, conflict on distinguishing α-inputs), DB-RPNI is guaranteed to find the unique minimal RM consistent with the samples.

Active Learning

Active learning invokes Angluin's L* algorithm, adapted for Mealy machines, which actively constructs observation tables using membership and equivalence queries. In the RL context, membership queries are answered by synthesizing environment traces that realize a prescribed sequence of high-level events, while equivalence is approximated via conformance testing against observed rewards. This approach enables the agent to systematically refine its RM hypothesis until agreement is achieved with the real underlying structure (Rens et al., 2020, Rens et al., 2020).

Both passive and active inference guarantee, under reasonable completeness conditions, convergence to the minimal RM that faithfully models the observed reward structure, provided the task is realizable by a finite-state Mealy machine.

4. Integration into Reinforcement Learning Frameworks

Once an RM is available, RL proceeds in the product-state space—augmented with the RM state—to accommodate the task memory encoded by the RM:

Tabular and Deep Q-Learning: Value or Q-functions are indexed by RM state and environment observation, i.e., $Q_u(o,a)$ , supporting decomposition into RM-indexed subproblems and parallel learning per RM state (Icarte et al., 2021).
Policy Gradient Methods: RM-augmentation enables policies conditioned on the RM state, which can be integrated into neural policy architectures.
Decentralized Multi-Agent RL: RMs can be decomposed/projection-aggregated (via quotient construction) to define agent-specific sub-tasks and synchronize team-level progress (Neary et al., 2020).
Off-Policy RL Acceleration: QRM-style updates, which “imagine” reward assignments for off-policy traces using the RM, accelerate off-policy learning.

Table: RM Learning Paradigms and Algorithms

Paradigm	Algorithm	Guarantees
Passive	RPNI/DB-RPNI	Finds minimal consistent RM if complete
Active	L*	Finite query learning, correctness
RL Integration	QRM, Product	Markovianization, global policy recovery

5. Beyond Standard RMs: Timed Reward Machines and Transition Machines

Timed Reward Machines (TRMs)

TRMs extend the Mealy RM formalism with real-valued clocks and clock guards to express timing constraints—delays, deadlines, and time-sensitive rewards. Formally, transitions are guarded by clock formulas and may reset specified clocks; transition and state-based (sojourn) rewards depend on clock values and time spent in each state. TRMs are interpreted either in digital or dense-time semantics, with product-state RL adapted accordingly. "Counterfactual-imagining" heuristics exploit the TRM structure to manufacture auxiliary training data, accelerating convergence (Majumdar et al., 19 Dec 2025).

Transition Machines (TMs) and Dual-Behavior Mealy Machines (DBMMs)

RM expressiveness is limited to reward-based non-Markovianity; it does not encode hidden state effects on future observable transitions (e.g., in POMDPs, tasks where e.g., “having the key” affects both future observations and rewards). Transition Machines, as Mealy-style automata predicting future observations, fill this gap. The unifying Dual-Behavior Mealy Machine framework parameterizes both reward and transition dependencies within a single automaton (Wu et al., 3 Aug 2025).

6. Empirical Results and Practical Impact

Empirical studies demonstrate RMs’ effectiveness in both single-agent and multi-agent partially observable domains:

In MiniGrid-like tasks, RM-augmented agents (with RM learned via local search or tabular optimization) achieve near-optimal policies rapidly—surpassing LSTM-based or pure history-augmented RL baselines, which typically fail in sparse-reward or high-variance inference contexts (Icarte et al., 2021).
Joint iterative inference of RMs and policies accelerates convergence in tasks with sparse or non-Markovian rewards, including vehicle control, office delivery, and Minecraft-style crafting (Xu et al., 2019).
In multi-agent settings, RM decomposition enables team policy learning orders-of-magnitude faster than centralized RL; decomposability is formalized, and value bounds are established (Neary et al., 2020).
Timed RM-based RL achieves success in timing-sensitive benchmarks, enforcing nontrivial time-based reward structures not capturable by standard (untimed) RMs (Majumdar et al., 19 Dec 2025).
Recent advances in automata inference such as DB-RPNI yield up to three orders of magnitude speedup over HMM or ILP-based methods on RL benchmark domains—critical for applicability as task complexity increases (Wu et al., 3 Aug 2025).

7. Limitations, Extensions, and Research Directions

While RMs and their extensions restore Markovianity for reward and support efficient decomposition, several open challenges remain:

RMs alone do not suffice when both reward and transition non-Markovianity coexist—necessitating broader automata such as TMs or unified DBMMs.
Scalability is bottlenecked by the potential size of the RM/TM state space for complex specifications. Structure-completeness of data and efficient minimization algorithms (e.g., DB-RPNI) address this partially, but inference in practice may still be NP-hard in the worst case (Wu et al., 3 Aug 2025, Xu et al., 2019).
Extending RMs for continuous or hybrid-state environments, handling stochastic or noisy event detection, and integrating with deep RL frameworks pose ongoing research problems.
Timed and hierarchical extensions (as in TRMs or compositional RMs) broaden expressive power but introduce additional algorithmic and computational complexity.

The framework of Mealy Reward Machines thus constitutes a foundational and practically impactful tool at the intersection of automata theory and reinforcement learning, with ongoing progress on both the algorithmic and application fronts.