Reward Machine in Reinforcement Learning

Updated 19 July 2025

Reward Machines are finite-state automata that represent temporally extended, history-dependent rewards, allowing complex tasks to be decomposed into simpler subtasks.
They leverage methods like reward shaping, counterfactual reasoning, and hierarchical RL to boost sample efficiency and convergence in learning.
Their applications span gridworlds, robotics, multi-agent systems, and cybersecurity, making them a versatile tool for modern reinforcement learning challenges.

Reward Machines (RM) are a class of automaton-based structures introduced to provide a high-level, formal representation of temporally extended reward functions in reinforcement learning (RL). By exposing the logical and temporal structure behind reward assignments, reward machines enable RL agents to solve complex tasks more efficiently by decomposing them into subproblems, exposing history-dependencies, and supporting more sample-efficient learning. The reward machine paradigm is especially impactful in settings involving non-Markovian rewards, multi-agent cooperation, transfer learning, and practical domains such as robotics and cyber-physical systems.

1. Formal Definition and Expressive Properties

A reward machine is a finite-state machine, generally formulated as a Mealy machine, that maps sequences of high-level environment events (propositions) to abstract reward delivery. Formally, a typical reward machine is defined as:

$R = \langle U, u_0, \Sigma, \delta, r, F \rangle$

where

$U$ is a finite set of RM states, each representing a particular “stage” of task progress,
$u_0 \in U$ is the initial state,
$\Sigma$ is a finite set of high-level environment events (e.g., propositional symbols or event labels),
$\delta: U \times \Sigma \to U$ is the state-transition function,
$r: U \times U \to \mathbb{R}$ (or $r: U \times \Sigma \to \mathbb{R}$ in some variants) is the output (reward) function,
$F \subseteq U$ is the set of final (reward-accepting) states.

Transitions in the RM are triggered by observable (sometimes abstract) events in the agent’s environment, which are provided by a labeling function $L$ . The RM’s output function $r$ delivers rewards on state transitions, allowing designers to specify temporally extended, history-dependent rewards in a structured and interpretable manner. RMs have the expressive power of regular languages and can encode any Markovian or regular non-Markovian reward function, supporting complex task compositions (such as loops, sequences, and conditionals), and can be interpreted as finite automata over high-level events (Icarte et al., 2020).

2. Exploiting RM Structure: Learning Algorithms and Task Decomposition

Reward machines support several classes of learning algorithms that exploit their structure:

Automated Reward Shaping

A reward potential function $\Phi: U \to \mathbb{R}$ can be defined over RM states. The shaped reward is given by:

$r'(s, a, s') = r(s, a, s') + \gamma \Phi(u') - \Phi(u)$

where $(s, u)$ and $(s', u')$ are the current and next states in the product MDP. This ensures that the agent is incentivized to follow progress along the RM structure, improving exploration without altering the optimal policy set (Icarte et al., 2020).

Counterfactual Reasoning

Each recorded environmental transition is “replayed” from the perspective of all (or a subset of) RM states. This simulates what would have happened if the agent were in any RM state at that timestep, providing counterfactual experiences that can be used to update the Q-functions associated with each RM state. Such data augmentation accelerates learning and increases sample efficiency, especially in sparse reward settings (Icarte et al., 2020).

Decomposition via Hierarchical RL

Because each RM state corresponds to a subtask, RM-based task decomposition enables hierarchical RL. This is typically realized through the “options” framework, where each option targets achieving a specific RM transition. The agent can maintain separate policies for each RM subproblem, and a high-level policy orchestrates the selection of options, resulting in improved convergence, as empirically demonstrated in gridworld and continuous control domains (Icarte et al., 2020).

Product MDPs

By constructing a product MDP that augments the environment state $s \in S$ with the RM state $u \in U$ , the non-Markovian reward in the original environment is recast as a Markovian reward over the product space $(s, u)$ . This “re-markovizes” the reward and allows for the application of standard RL algorithms such as Q-learning and actor-critic methods with little overhead (Hu et al., 2021, Neary et al., 2020).

3. Multi-Agent and Decentralized Learning with Reward Machines

Reward machines are particularly well suited to cooperative multi-agent RL (MARL):

The global team task RM can be decomposed onto local RMs for each agent by projecting the team RM onto the events observable to individual agents. This yields decentralized subproblems, with each agent learning with respect to its projection (Neary et al., 2020, Ardon et al., 2023).
Theorem 1 in (Neary et al., 2020) establishes that distributed completion (each agent achieving its subtask) is equivalent to global task completion if the RM decomposition is bisimilar to the original.
Decentralized Q-learning with projected RMs (DQPRM) allows each agent to update Q-functions associated with the subtask RM state it currently occupies, using only local observations and (optionally) synchronized event signals with teammates. This is shown to improve scalability, address non-stationarity, and dramatically boost sample efficiency in both rendezvous and sequential coordination tasks (Neary et al., 2020, Ardon et al., 2023, Hu et al., 2021).
Value function bounds for the global team task are established using the Fréchet conjunction inequality:

$\max\{0, V_1(s) + \dots + V_N(s) - (N-1)\} \leq V(s) \leq \min\{V_1(s), \dots, V_N(s)\}$

where $V(s)$ is the probability of completing the task and $V_i(s)$ is the probability that agent $i$ completes its subtask (Neary et al., 2020).

4. Extensions: Robustness, Hierarchies, Numeric RMs, Automated Inference

Reward Machine Learning and Robustness

RMs can be learned directly from execution traces using forms of discrete optimization or inductive logic programming (e.g., MILP, CP, ILASP), with specific techniques designed to handle noisy or partial-label observation settings (Icarte et al., 2021, Parac et al., 27 Aug 2024, Baert et al., 13 Dec 2024). RM inference is interleaved with policy learning, updating the RM as new trajectories suggest modifications. Bayesian updating and probabilistic reward-shaping methods have been developed to maintain robustness under sensor noise or partial observability, incorporating beliefs over RM states and updating rewards accordingly (Parac et al., 27 Aug 2024, Li et al., 31 May 2024).

Adversarial Attacks and Security

Reward machine-based agents can be vulnerable to adversarial manipulation, particularly via blinding attacks that tamper with the labeling function and desynchronize the RM from the true task progress. The success of attacks and robustness depends on the complexity and ambiguity of the high-level event vocabulary, as well as the use of auxiliary reward shaping (Nodari, 2023).

Numeric Reward Machines

Classical RMs operate only on Boolean features, but recent work has extended this to numeric RMs, where transitions and reward outputs can depend directly on real-valued features (e.g., distance-to-goal), enabling their application to inherently quantitative tasks. Both direct numeric rewards and numeric-Boolean emulations are possible, with the form $R(u, d) = -d$ for distance $d$ , and specific reward calibration to favor optimal solutions (Levina et al., 30 Apr 2024).

Hierarchical and Maximally Permissive RMs

Recent work has introduced hierarchical reward machines (which allow composition through subroutine invocation) and maximally permissive reward machines (which systematically allow all valid plans, not just a single prescribed sequence) (Furelos-Blanco et al., 2022, Varricchione et al., 15 Aug 2024). Maximally permissive RMs synthesized from all partial-order plans attain or surpass the optimal value of single-sequence RMs, with formal guarantees.

Automated Inference from Unstructured Data

State-of-the-art methods can now infer RM structures from trajectories—either via logic programming (when event detectors are available) or directly from high-dimensional raw observations, as in robotic manipulation with vision-based LfD (Learning from Demonstration), where cluster analysis in latent space yields subgoal states and transitions, enabling effective potential-based reward shaping for RL agents (Baert et al., 13 Dec 2024).

5. Applications and Empirical Performance

Reward machines have demonstrated efficacy in a range of RL domains:

Gridworlds and Continuous Control: Experiments in office and Minecraft-like worlds, WaterWorld, and robotic benchmarks demonstrate substantial improvements in sample efficiency and policy quality. Algorithms exploiting RM structure, such as counterfactual reasoning and hierarchical RL, consistently outperform standard cross-product baselines (Icarte et al., 2020).
Multi-Agent Systems: RMs facilitate explicit coordination, task decomposition, and decentralized policy learning in cooperative tasks with temporal dependencies, showing faster convergence compared to centralized or hierarchical approaches in multi-agent rendezvous and button-press environments (Neary et al., 2020, Ardon et al., 2023, Hu et al., 2021).
Robotic Manipulation: RM-augmented DQN agents using reward shaping and abstract state channels significantly accelerate learning in robot stacking and kitting tasks (Camacho et al., 2020, Baert et al., 13 Dec 2024).
Cybersecurity and Penetration Testing: In AutoPT, RMs encode expert knowledge (e.g., MITRE ATT&CK patterns) as subtask transitions, enabling deep Q-learning with explicit task decomposition and improved interpretability. More detailed RMs further enhance learning efficiency and policy quality (Li et al., 24 May 2024).
Transfer and Generalization: RM abstractions enable transferable representations that support few-shot learning in new tasks, with context-sensitized pre-planning and subgoal-level reward shaping improving transfer sharpness and sample efficiency—including empirically measured time-to-threshold and jumpstart boost in new environments (Azran et al., 2023).

6. Limitations, Open Problems, and Future Directions

While reward machines have proven beneficial, several challenges and open issues remain:

Expressivity Boundaries: The formalism is limited to regular properties and cannot express context-sensitive or unbounded counting tasks. Extending RMs to richer language classes (e.g., context-free grammars) remains an ongoing research area (Icarte et al., 2020).
Labeling Function and Abstraction Quality: Reliable mapping from environment states to high-level propositions is critical—errors or ambiguities in these detectors can lead to failures, and learning robust abstractions under noise is an active area of development (Icarte et al., 2021, Li et al., 31 May 2024).
Security and Robustness: Adversarial attacks can exploit the abstraction interface, and defense mechanisms (such as robust RM inference, probabilistic belief tracking, or adversarial training) are subjects of emerging work (Nodari, 2023, Parac et al., 27 Aug 2024).
Automated Inference: Efficiently learning RM structures—including event detectors and transitions—from demonstrations or experience in large, partially observable domains remains a challenging area, especially when demonstration quality varies or when true subgoal boundaries are ambiguous (Baert et al., 13 Dec 2024, Furelos-Blanco et al., 2022).
Hierarchical RM Design and Synthesis: Algorithms for learning and exploiting HRMs (hierarchies of reward machines) offer significant performance benefits but require scalable grammar induction and curriculum strategies (Furelos-Blanco et al., 2022).
Integration with LLMs and Planning: Recent research shows promise in using LLMs to automate and expedite the specification of RM automata from natural language task descriptions, as well as in synthesizing maximally permissive RMs from (AI) planners, broadening the practical adoption of RM-based RL (Alsadat et al., 11 Feb 2024, Varricchione et al., 15 Aug 2024).

7. Summary Table: Key Aspects and Variants

Aspect	Description	Notable References
RM Formalism	Finite-state Mealy machine over event labels	(Icarte et al., 2020, Neary et al., 2020)
Learning Algorithms	Reward shaping, CRM, HRM, DQPRM, belief-based shaping	(Icarte et al., 2020, Neary et al., 2020, Parac et al., 27 Aug 2024)
Multi-agent support	Task decomposition via projection/projection bisimulation	(Neary et al., 2020, Ardon et al., 2023, Hu et al., 2021)
Numeric Extensions	Numeric and numeric-Boolean RMs for quantitative tasks	(Levina et al., 30 Apr 2024)
Hierarchical/Permissive	HRM, maximally permissive RMs	(Furelos-Blanco et al., 2022, Varricchione et al., 15 Aug 2024)
Robustness	Learning from noisy labels, adversarial blinding	(Parac et al., 27 Aug 2024, Nodari, 2023, Li et al., 31 May 2024)
Automated Inference	RM induction from demonstration, LLM-guided automata	(Baert et al., 13 Dec 2024, Alsadat et al., 11 Feb 2024)

References to Key Papers

Reward Machine formalism and structure: (Icarte et al., 2020)
RM-based decentralized MARL: (Neary et al., 2020, Hu et al., 2021, Ardon et al., 2023)
Learning and robustifying RMs: (Icarte et al., 2021, Parac et al., 27 Aug 2024, Li et al., 31 May 2024, Baert et al., 13 Dec 2024)
Numeric extensions: (Levina et al., 30 Apr 2024)
Hierarchical/maximally permissive constructions: (Furelos-Blanco et al., 2022, Varricchione et al., 15 Aug 2024)
Application to cybersecurity: (Li et al., 24 May 2024)
Automated RM generation via LLMs: (Alsadat et al., 11 Feb 2024)

Reward machines constitute a principled and modular framework for specifying, learning, and exploiting temporally extended reward structures in a wide range of RL problems. Through continued development—especially in robustness, abstraction learning, hierarchical composition, and integration with symbolic planning—RMs are positioned as a cornerstone for scalable, interpretable, and sample-efficient RL.