Reward Machines in MiniGrid Levels

Updated 23 November 2025

Reward Machines are automata-based formalisms that decompose sparse, long-horizon reinforcement tasks into explicit sub-goals with event-triggered rewards.
Automatic synthesis via foundation models and passive trace-based inference enables precise RM construction, reducing sample complexity in MiniGrid environments.
Embedding natural language instructions in RM states allows for zero-shot transfer and structured policy decomposition, enhancing learning in complex tasks.

Reward Machines (RMs) offer an automata-based formalism for structuring and specifying reward functions in reinforcement learning, particularly in sequential and compositional tasks such as those encountered in the MiniGrid suite. By decomposing sparse long-horizon objectives into explicit sub-goals and transitions over well-defined event predicates, RMs transform the underlying Markov decision process—often partially observable and highly non-Markovian—with a reward structure that enables scalable and sample-efficient learning through augmented state representations. Recent advances leverage foundation models and passive automata inference to acquire RMs automatically, yielding substantial improvements in multiple MiniGrid environments and supporting robust zero-shot generalization, memoryless policy decomposition, and sample complexity reductions (Castanyer et al., 16 Oct 2025, Icarte et al., 2021, Wu et al., 3 Aug 2025).

1. Formal Definition of Reward Machines

An RM is defined as a tuple $M = (U, u_0, \Sigma, \delta, R, F, L)$ , with:

$U$ : finite set of automaton (RM) states, each representing a unique sub-goal.
$u_0\in U$ : unique initial state.
$\Sigma$ : finite alphabet of Boolean event symbols (e.g., $\mathrm{has\_key}$ , $\mathrm{door\_opened}$ ).
$\delta: U \times \Sigma \to U$ : deterministic transition function.
$R: U \times \Sigma \to \mathbb{R}$ : scalar reward function associated with each transition.
$F\subseteq U$ : set of accepting/final states (absorbing).
$L: \mathcal{S}_\mathrm{meta} \times A \to \Sigma$ : labeling function mapping MDP transitions to event symbols.

For each event $\sigma\in\Sigma$ and state $u\in U$ , a transition $(u,\sigma)$ yields the next state $\delta(u,\sigma)$ and a reward $R(u,\sigma)$ . Unspecified transitions default to self-loops with zero reward. In some formulations, $\delta$ and $R$ may accept sets $2^\Sigma$ or $2^{\mathcal{P}}$ of simultaneously holding predicates, but MiniGrid RMs typically process a single Boolean event per step for practical clarity (Castanyer et al., 16 Oct 2025, Icarte et al., 2021, Wu et al., 3 Aug 2025).

2. Automatic Synthesis and Inference of Reward Machines

Automated RM synthesis addresses the challenge of precise reward specification. Two principal methodologies are prominent:

Foundation Model-Aided Synthesis (ARM-FM):
- A generator FM is prompted with a natural-language mission, the MiniGrid environment API, and an explicit RM template. The FM returns a succinct RM specification: $U$ , $u_0$ , $\Sigma$ , $\delta$ , $R$ .
- A parser ingests the FM's output, constructing a programmatic representation and deriving executable labeling functions $L$ via code-specialized FMs.
- Generator and critic FMs co-train in a loop wherein the critic enforces predicate correctness, minimality, coverage, and consistent reward allocation, driving iterative improvement until logical correctness is certified.
- This process enables RM construction directly from intuitive task descriptions and symbol detectors (Castanyer et al., 16 Oct 2025).
Passive Trace-Based Inference (DB-RPNI for DBMM):
- The Dual-Behavior Mealy Machine (DBMM) formalism generalizes RMs for both reward- and transition-based abstractions. The DB-RPNI algorithm infers minimal DBMMs through a two-phase process: sample-set construction (from labeled MiniGrid trajectories) and state merging over a prefix-tree structure, subject to local output compatibility.
- For MiniGrid, AP is constructed from events such as $\{\mathrm{picked\_key}, \mathrm{opened\_door}, \mathrm{reached\_goal}\}$ . Labeled event trajectories are collected (typically 1,000–10,000 traces), preprocessed to remove redundant or trivial events, and converted into RM sample sequences. The algorithm iteratively merges states with compatible output histories, yielding a compact RM that encapsulates all necessary event/reward dynamics (Wu et al., 3 Aug 2025).

3. Embedding Event Abstraction and Language Alignment

To support generalization and subpolicy composition, each RM state $u\in U$ is augmented with an FM-generated English instruction $\mathrm{Instr}(u)$ . This instruction is mapped to an embedding $z_u\in\mathbb{R}^d$ via a pretrained text encoder (e.g., from Qwen/Mistral families). During RL, policy $\pi(a | s, z_u)$ is conditioned on these semantically aligned embeddings, facilitating:

Zero-shot transfer to structurally similar unseen RMs by leveraging clusterings of embeddings for semantically related instructions (e.g., “pick up blue key” and “pick up red key”).
Faster convergence on related subtasks in procedurally generated and held-out MiniGrid environments (Castanyer et al., 16 Oct 2025).

Empirical results show start/middle/end sub-task clusters in embedding space, with natural grouping of like sub-tasks, supporting efficient policy re-use.

4. RM-Based Decomposition and Policy Learning

Recasting RL with RMs entails augmenting observations with the current RM-state $u$ , creating a Markov process over $(o, u)$ . Each RM state defines a memoryless subpolicy:

$\pi(a \mid o, u) = \pi_u(a \mid o)$

and a corresponding Q-function $Q_u(o, a)$ . At each experience $(o_t, a_t, o_{t+1})$ with abstract event $\sigma = L(o_t, a_t, o_{t+1})$ ,

$Q_u(o, a) \leftarrow Q_u(o, a) + \alpha [R(u, \sigma) + \gamma \max_{a'} Q_{\delta(u, \sigma)}(o', a') - Q_u(o, a)]$

Thus, complex long-horizon tasks are decomposed into structured subtasks, each directly shaped by intermediate reward signals. This dramatically improves sample efficiency compared to flat or extrinsic-only rewards, which are often sparse and delayed in MiniGrid domains (Icarte et al., 2021).

5. Practical Application in MiniGrid Environments

RMs are applied to standard MiniGrid levels such as DoorKey, BlockedUnlockPickup, UnlockToUnlock, KeyCorridor, MultiRoom, and ObstructedMaze. The process involves:

Mission description and API details provided to an FM; RM with labeled event symbols and reward shaping intermediate sub-goals is synthesized.
Labeling functions are implemented efficiently (typically <10 lines each), mapping environment transitions to symbols for event detection.
RL agent (DQN+RM) operates over the joint state $(o, u)$ , or in the case of QRM, over a separate Q-value for each RM-state.
Benchmarks consistently show that DQN+RM or LRM+DDQN surpass vanilla RL methods, ICM, LLM-policy, and CLIP-reward baselines, reaching high rewards by several hundred thousand steps—even in procedurally generated or long-horizon levels where all baselines fail to learn (Castanyer et al., 16 Oct 2025).

Feature extraction (e.g., from $7\times7$ partial FoV images with object and location channel encoding) is coupled with RM-state indicators, yielding networks that abstract over events rather than raw grid positions.

Example: “Pick up the red key and then open the door”

Given the mission, “Pick up the red key and then open the door,” the synthesized RM for MiniGrid can be specified as:

$\Sigma = \{\mathrm{has\_red\_key}, \mathrm{door\_opened\_red}\}$
$U = \{u_0, u_1, u_2\}$ ; instructions: “Pick up the red key.”, “Open the red door.”, “Done.”
State transitions:
- $(u_0, \mathrm{has\_red\_key}) \rightarrow u_1$ , reward $+0.3$
- $(u_1, \mathrm{door\_opened\_red}) \rightarrow u_2$ , reward $+1.0$
- Else, self-loops with zero reward

Labeling (in Python):

def has_red_key(env):
    return env.carrying is not None and env.carrying.type=="key" and env.carrying.color=="red"
def door_opened_red(env):
    return any(obj.type=="door" and obj.color=="red" and obj.is_open for obj in env.grid)

LaTeX transition:

$\delta(u, \sigma) = \begin{cases} u_1 & \text{if } u=u_0 \wedge \sigma = \mathrm{has\_red\_key} \ u_2 & \text{if } u=u_1 \wedge \sigma = \mathrm{door\_opened\_red} \ u & \text{otherwise} \end{cases}$

$R(u, \sigma) = \begin{cases} 0.3 & \text{if } u=u_0, \sigma = \mathrm{has\_red\_key} \ 1.0 & \text{if } u=u_1, \sigma = \mathrm{door\_opened\_red} \ 0 & \text{otherwise} \end{cases}$

This RM provides dense reward shaping, improving credit assignment by splitting the overall task into short-horizon subgoals and enabling the agent to learn key pickup and door opening orders-of-magnitude faster than under sparse goal-only rewards (Castanyer et al., 16 Oct 2025).

6. Inference Algorithms and Computational Considerations

Passive state-merging inference for RMs, as in the DB-RPNI algorithm, is both efficient and provably correct under structure completeness. The computational complexity to obtain a minimal correct automaton is $O(|U|\cdot|L|\cdot T\cdot F)$ , and empirical results show that RMs for MiniGrid (typically 4–8 states) can be inferred in minutes on CPU hardware (Wu et al., 3 Aug 2025).

Algorithmic steps for MiniGrid RM inference comprise collecting sufficient diverse labeled traces (ensuring all event sequences), constructing prefix-tree transducers from symbol-labeled trajectories, and state merging with local compatibility checks over output histories. Event detector specification and preprocessing (e.g., compressing runs of identical symbols) are critical for sample efficiency and accuracy, while hyperparameter choices (confidence threshold $\delta$ , maximum automaton size) directly influence automaton minimality and merging fidelity.

7. Limitations, Extensions, and Impact

Key limitations for RM approaches in MiniGrid include the requirement for a well-specified detector set $\mathcal{P}$ , possible imperfection in highly stochastic or information-lossy environments, and the need for sufficient coverage in observed traces to guarantee structure completeness (Icarte et al., 2021). Extensions proposed in the literature include combining RM inference with intrinsic exploration, on-the-fly automata learning, and interactive or expert-driven event set reduction.

The impact is marked: experimental findings demonstrate that RM-augmented RL achieves near-optimal sample efficiency and task completion rates on MiniGrid levels where sparse extrinsic reward or memory-augmented baselines stagnate. Structured, compositional reward design thus enables the systematic transformation of intractable long-horizon or procedural tasks into learnable curricula, establishing RMs as critical instruments for reinforcement learning in abstract, partially observable, and compositional domains (Castanyer et al., 16 Oct 2025, Icarte et al., 2021, Wu et al., 3 Aug 2025).