ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning (2510.14176v1)

Published 16 Oct 2025 in cs.AI and cs.LG

Abstract: Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

Summary

The paper introduces ARM-FM, which leverages foundation models to generate reward machines from natural language task descriptions for compositional RL.
The framework employs Language-Aligned Reward Machines that integrate natural language instructions with state embeddings to provide dense, interpretable feedback.
Empirical evaluations across diverse environments demonstrate that ARM-FM outperforms baselines in sparse-reward tasks and enables zero-shot task generalization.

Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Introduction and Motivation

The specification of reward functions remains a central bottleneck in scaling reinforcement learning (RL) to complex, long-horizon tasks. Sparse rewards impede credit assignment, while dense, hand-crafted rewards are prone to reward hacking and require significant domain expertise. The ARM-FM framework addresses this by leveraging the high-level reasoning and code-generation capabilities of foundation models (FMs) to automate the construction of reward machines (RMs)—finite-state automata that encode compositional, temporally extended reward structures. This approach enables RL agents to receive structured, interpretable, and dense reward signals derived from natural language task descriptions, facilitating both efficient learning and generalization.

Figure 1: An overview of the ARM-FM framework (left) and its empirical results in a complex sparse-reward environment (right).

ARM-FM Framework

Language-Aligned Reward Machines (LARMs)

ARM-FM introduces Language-Aligned Reward Machines (LARMs), which extend standard RMs by associating each automaton state with a natural language instruction and a corresponding embedding. The FM is prompted with a high-level task description and a visual observation, and it generates:

The RM structure (states, transitions, events)
Executable Python labeling functions for event detection
Natural language instructions for each RM state

This process is iteratively refined using a generator-critic FM loop, optionally incorporating human verification for correctness and compactness. The resulting LARM provides a compositional decomposition of the task, with each subgoal semantically grounded in language.

Policy Conditioning and Skill Space

During RL training, the agent's policy is conditioned on both the environment state and the embedding of the current LARM state. This enables the agent to recognize and reuse skills across tasks with semantically similar subgoals, as the embedding space clusters related instructions. The reward at each timestep is the sum of the environment reward and the LARM reward, with the latter providing dense feedback for subgoal completion.

Empirical Evaluation

ARM-FM is evaluated across a diverse suite of environments, including MiniGrid (sparse-reward gridworlds), Craftium (3D Minecraft-like resource gathering), Meta-World (robotic manipulation with continuous control), and XLand-MiniGrid (procedural meta-RL). The framework is compared against strong baselines, including intrinsic motivation (ICM, RND), LLM-as-agent, and VLM-based reward models.

Sparse-Reward Gridworlds

On MiniGrid DoorKey and long-horizon BabyAI tasks, ARM-FM consistently outperforms all baselines, solving tasks that are intractable for standard RL and intrinsic motivation methods.

Figure 2: Performance on MiniGrid-DoorKey environments of increasing size. ARM-FM achieves higher rewards across all tasks.

Figure 3: Performance on complex, long-horizon MiniGrid tasks. Only ARM-FM solves all three tasks.

3D Procedural Environments

In Craftium, a procedurally generated Minecraft-like environment, ARM-FM enables PPO agents to solve resource-gathering tasks requiring long action sequences, where baseline agents fail due to the extreme sparsity of the reward.

Figure 1: (Right) ARM-FM enables efficient learning in a complex, sparse-reward 3D environment.

Robotic Manipulation

On Meta-World, ARM-FM automates reward engineering for continuous control tasks, providing dense supervision and enabling higher success rates than sparse reward agents. The LARM can be combined with intrinsic exploration bonuses (e.g., RND) for further gains.

Figure 4: Performance on Meta-World manipulation tasks. ARM-FM achieves high success rates compared to sparse reward agents.

Generalization and Zero-Shot Transfer

Ablation studies in XLand-MiniGrid demonstrate that both the structured rewards and the language-based state embeddings are essential for robust multi-task learning. Conditioning the policy on LARM embeddings enables zero-shot generalization: agents can solve novel composite tasks composed of familiar subgoals without additional training.

Figure 5: Demonstration of zero-shot generalization. The agent reuses learned skills to solve an unseen composite task.

Analysis of FM Generation and Embedding Space

Larger FMs (e.g., Qwen3-32B) generate syntactically and semantically correct RMs and labeling code with higher reliability. The embedding space of state instructions, visualized via PCA, exhibits clear semantic structure, with start, middle, and end states forming distinct clusters. This structure underpins the agent's ability to generalize and transfer skills across tasks.

Figure 6: FM generation correctness scales with model size.

Implementation Considerations

Training Procedure

The RL agent is trained using standard algorithms (DQN, PPO, SAC, Rainbow) with the following modifications:

The state input is augmented with the LARM state embedding.
The reward is the sum of the environment and LARM rewards.
The experience replay buffer stores transitions over the joint state space $(s_t, u_t)$ .

A detailed pseudocode for DQN with LARMs is provided in the appendix of the paper.

Resource Requirements

FM inference for RM and labeling function generation is performed offline and can be parallelized.
The additional computational overhead during RL training is minimal, as LARM state transitions and reward computations are lightweight.
The embedding function can be implemented using any pretrained LLM (e.g., Qwen, Llama, Gemma).

Limitations and Trade-offs

Human verification of FM-generated RMs can improve quality but is not strictly required; automated self-correction via formal verification is a promising direction.
The approach assumes that the FM can generate correct and executable labeling functions for the environment API.
The semantic quality of the embedding space is critical for generalization; poor embeddings may limit transfer.

Implications and Future Directions

ARM-FM demonstrates that FMs can bridge the gap between high-level task specification and low-level RL reward shaping, automating a process that previously required significant manual effort. The compositional, language-aligned structure of LARMs enables interpretable, modular, and generalizable RL agents. This paradigm supports human-in-the-loop refinement, facilitates policy transfer, and provides a foundation for hierarchical RL.

Future work may focus on:

Fully automating the RM verification and refinement process via formal methods.
Extending the approach to richer, more expressive automata (e.g., pushdown reward machines).
Integrating multimodal FMs for environments with complex visual and symbolic state spaces.
Scaling to real-world robotics and multi-agent systems.

Conclusion

ARM-FM provides a principled and practical framework for automated, compositional reward design in RL, leveraging the reasoning and code-generation capabilities of foundation models. By translating natural language task descriptions into structured, language-aligned reward machines, ARM-FM enables RL agents to efficiently learn and generalize across complex, long-horizon tasks. The empirical results establish the effectiveness of this approach across diverse domains, and the modular, interpretable structure of LARMs paves the way for scalable, human-aligned RL systems.