Reward Machines in Reinforcement Learning

Updated 6 October 2025

Reward Machines are automata-based models that encode temporally extended reward functions in reinforcement learning through explicit state representations, enabling effective task decomposition.
They facilitate decentralized learning in multi-agent environments by projecting global RM structures onto agent-specific models and synchronizing high-level events.
Recent methodologies such as counterfactual experience generation, hierarchical RL, and automated reward shaping leverage RM structures to enhance policy efficiency and scalability.

Reward Machines (RMs) are formal automata-based representations designed to explicitly encode complex, temporally extended reward functions for reinforcement learning (RL). RMs expose task structure by defining states corresponding to stages of task progress and transitions triggered by high-level, abstract environmental events. This abstraction facilitates both the specification and exploitation of non-Markovian rewards—those depending on event history rather than the current environment state alone. RMs are utilized in a wide range of settings, including single-agent RL, cooperative and competitive multi-agent RL, task transfer, hierarchical task design, and robust learning under partial observability and noise.

1. Formal Definition and Automata Structure

An RM is specified as a tuple

$R = \langle U, u_0, \Sigma, \delta, r, F \rangle$

where:

$U$ is the finite set of RM states (automaton nodes, each encoding progress through the task),
$u_0 \in U$ is the initial state,
$\Sigma$ is a finite set of high-level events (abstract environmental changes, e.g., "button pressed"),
$\delta: U \times \Sigma \to U$ is the (partial) deterministic transition function,
$r: U \times U \to \mathbb{R}$ is the output function, assigning rewards to transitions,
$F \subseteq U$ is the set of final (or "reward") states.

Transitioning in the RM is driven by high-level events output by a labeling function, which abstracts the environment's raw sensory state into symbolic propositions. The RM "remembers" history by maintaining its automaton state alongside the environment state, converting otherwise non-Markovian reward dependencies into a Markovian product space $(s, u)$ (Neary et al., 2020, Icarte et al., 2020).

2. Task Decomposition and Decentralization in Multi-Agent RL

In cooperative multi-agent RL, a global team-level RM is used to encode the temporal dependencies and subtask structure required to accomplish a joint objective. To enable decentralized learning, the team RM is projected onto each agent's local event set $\Sigma_i \subset \Sigma$ by "merging" RM states connected by events not observable by agent $i$ . This projection yields agent-specific RMs

$R_i = \langle U_i, u_{0, i}, \Sigma_i, \delta_i, r_i, F_i \rangle$

where $U_i$ are equivalence classes resulting from the projection. Each local RM describes only those transitions and rewards relevant to its agent (Neary et al., 2020, Ardon et al., 2023).

A crucial algorithmic condition for sound decomposition is bisimilarity: the parallel composition of projected RMs ( $R_1 \parallel R_2 \parallel \dots \parallel R_N$ ) must be bisimilar to the original team RM. This ensures that if each agent completes its subtask, the global task is completed. Local labeling functions $L_i$ must also synchronize correctly on shared high-level events (Neary et al., 2020, Zheng et al., 8 Mar 2024).

3. Methodologies Leveraging RM Structure

Several algorithmic frameworks exploit the explicit structure of RMs to improve policy learning:

Counterfactual Experience Generation (CRM): When collecting real experiences, the agent also performs off-policy updates for every RM state, simulating rewards and transitions as if the agent had begun from alternative RM states (Icarte et al., 2020). This enables substantial experience sharing and accelerates learning.
Hierarchical RL via RMs (HRM): Each transition in the RM is regarded as an option (temporally extended action); lower-level controllers are trained to realize these transitions, while a high-level policy schedules them (Icarte et al., 2020, Furelos-Blanco et al., 2022).
Decentralized Q-Learning with Projected RMs (DQPRM): Each agent maintains Q-values over its local (environment, RM) state product, learns independently, and synchronizes only on shared events. This eliminates the curse of dimensionality in the joint action/state space (Neary et al., 2020).
Automated Reward Shaping: The RM is treated as a deterministic MDP; optimal value functions $v^*(u)$ over RM states are computed and used for potential-based reward shaping: $r'(s,a,s') = r(s,a,s') + \gamma \Phi(s',u')-\Phi(s,u)$ with $\Phi(s,u) = -v^*(u)$ (Icarte et al., 2020, Camacho et al., 2020).

4. RM Decomposition: Algorithmic Guarantees and Value Function Bounds

The decomposition of a team RM into agent-specific projections enables robust decentralized learning under certain verifiable criteria. For undiscounted settings, the true value function for the team task $V(s)$ is bounded as follows (Neary et al., 2020):

$\max\Big\{0, \sum_{i=1}^N V_i(s) - (N-1)\Big\} \leq V(s) \leq \min\big\{V_1(s), \dots, V_N(s)\big\}$

Here, $V_i(s)$ is the probability that agent $i$ completes its local subtask from state $s$ . This bound connects local learning progress to the global objective and underpins theoretical guarantees for convergence and correctness.

5. Empirical and Theoretical Impact

Empirical results across a variety of domains demonstrate significant improvements in sample efficiency, policy quality, and scalability when leveraging RM structure:

In multi-agent "buttons" and rendezvous domains, decentralized learning with projected RMs converges to effective team policies an order of magnitude faster than centralized methods and hierarchical-independent Q-learning approaches (Neary et al., 2020).
In gridworlds and continuous domains, RM-based agents with reward shaping or CRM (counterfactual) learning reach optimal policies with far fewer interactions than tabular or deep RL agents lacking RM structure (Icarte et al., 2020).
In graph-based MARL with coupled agent dynamics, the decentralized graph-based RM (DGRM) framework delivers performance improvements over baselines, with local information sufficient for agents to accomplish complex, temporally extended tasks (Hu et al., 2021).

These results are supported by theoretical analysis ensuring the preservation of the global task semantics under RM projection (bisimilarity), as well as bounds on function approximation errors that decay exponentially with neighborhood radius in multi-agent graphs (Hu et al., 2021).

6. Extensions: Hierarchical, Transferable, and Robust RMs

Recent research extends the basic RM concept along several axes:

Hierarchical RMs (HRMs): RMs may "call" other RMs as options, enabling modular decomposition of long-horizon or sparse tasks. HRMs support curriculum-based and multi-task learning, and their modular structure mitigates the state explosion observed in flat RM representations (Furelos-Blanco et al., 2022, Zheng et al., 8 Mar 2024).
Transfer and Symbolic Abstraction: RMs provide symbolic task abstractions that, when shared across tasks, enhance transfer and few-shot learning in deep RL. The use of desired transition labels and potential-based shaping facilitates exploitation of prior knowledge in new settings (Azran et al., 2023).
Learning RMs from Experience: Multiple algorithms have been proposed for learning RMs from demonstration traces (including noisy traces), using optimization, SAT/ILP-based synthesis, inductive logic programming, and abstraction from visual input (Icarte et al., 2021, Baert et al., 13 Dec 2024, Parac et al., 27 Aug 2024, Ardon et al., 31 Dec 2024).
Security and Robustness: The addition of symbolic state via RMs introduces new adversarial vulnerabilities, particularly through attacks that tamper with labeling outputs and desynchronize internal task progress. Identifying, quantifying, and mitigating these vulnerabilities is an active area (Nodari, 2023).
Expressivity Beyond Regular Languages: New automata such as pushdown reward machines (pdRMs) generalize RMs to deterministic context-free languages using stack memory, enabling RL for hierarchical or recursive behaviors not representable within finite-state memory (Varricchione et al., 9 Aug 2025).

7. Limitations and Future Directions

While RMs provide a flexible and principled means for encoding non-Markovian reward structure, several fundamental and practical challenges remain:

Dependency on High-Level Events and Propositions: RM effectiveness depends on the design or learning of informative labeling functions; poorly chosen or noisy abstractions can hinder learning and generalization (Camacho et al., 2020, Icarte et al., 2021, Li et al., 31 May 2024).
Automated RM Discovery: Fully automated, scalable synthesis of minimal or expressive RMs from real-world data or partially observed policies is an ongoing research question, with recent progress in SAT/ILP-based and logical inference methods (Shehab et al., 6 Feb 2025, Baert et al., 13 Dec 2024, Parac et al., 27 Aug 2024).
Extensions to Richer Task Classes: The expressiveness of standard RMs is limited to regular languages; extensions to context-free and richer formal classes require careful design to balance expressivity, tractability, and policy realizability (Varricchione et al., 9 Aug 2025).
Robustness to Adversarial Manipulation and Sensor Noise: The explicit symbolic state of RMs increases vulnerability to labeling attacks and noisy measurements, requiring algorithmic countermeasures and robust learning protocols (Nodari, 2023, Li et al., 31 May 2024, Parac et al., 27 Aug 2024).

In summary, Reward Machines provide a unified, automata-theoretic foundation for specifying, decomposing, and harnessing the structure of temporally extended tasks in RL. Their formalism supports principled task decomposition, leverages rich temporal specifications, and underpins modern advances in interpretable, efficient, and robust RL across single and multi-agent domains.