Generalized Reward Matching in RL

Updated 6 January 2026

GRM is a comprehensive framework for reward shaping that preserves optimal policies by integrating arbitrary, non-Markovian intrinsic rewards.
It employs matching functions to adjust raw intrinsic rewards, ensuring that the altered rewards do not change the set of optimal actions.
GRM generalizes potential-based reward shaping and finds applications in reinforcement learning, generative reward modeling for LLMs, and robotics.

Generalized Reward Matching (GRM) is a comprehensive framework for constructing reward-shaping procedures in reinforcement learning (RL) that guarantee preservation of optimal policies, regardless of the complexity or non-Markovian nature of intrinsic reward sources. It enables arbitrary intrinsic (exploration or curiosity-driven) rewards to be incorporated in a policy-invariant manner, thus avoiding the performance pitfalls and reward hacking typically induced by naive reward augmentation. GRM also features in unrelated domains: e.g., as a generative reward modeling paradigm for LLM-based reward models and for dynamic resource matching, but the formal GRM theory as introduced in reinforcement learning is based on the potential-based reward shaping (PBRS) lineage and generalizes it (Forbes et al., 2024).

1. Formal Definition and Mathematical Structure

Let $M = (\mathcal{S},\mathcal{A},T,\gamma, R)$ be an episodic Markov Decision Process (MDP) with extrinsic reward $R(s,a,s')$ . In the presence of an intrinsic motivation (IM) signal, the total reward is often naively written as $R + \alpha F_t$ , where $F_t$ is any intrinsic reward and $\alpha$ is a scalar coefficient. However, this sum can alter the set of optimal policies. GRM defines a new shaping reward $F'_t$ such that

$R'(s_t,a_t,s_{t+1}) = R(s_t,a_t,s_{t+1}) + F'_t$

with the core update given by

$F'_t = F_t - \sum_{i=0}^t \gamma^{i-t} F_i\, m_{t,i}$

where $m_{t,i}$ is a matching function with two requirements:

Full-matching: $\forall t', \sum_{j=t'}^{N-1} m_{j,t'} = 1$
Future-agnostic: For $R(s,a,s')$ 0, $R(s,a,s')$ 1

GRM generalizes all potential-based reward shaping (PBRS) methods. There always exists a potential function $R(s,a,s')$ 2 such that

$R(s,a,s')$ 3

and reciprocally, for any potential-based shaping, a corresponding matching function exists that recovers GRM in this form (Forbes et al., 2024).

2. Policy Invariance and Theoretical Guarantees

The primary theoretical property of GRM is policy invariance: the set of optimal policies for the shaped MDP with $R(s,a,s')$ 4 is identical to that of the original MDP with $R(s,a,s')$ 5, provided mild boundary conditions hold (expectation of terminal potentials is action-independent). The proof shows that added shaping alters the $R(s,a,s')$ 6-values only by an action-independent offset, leaving $R(s,a,s')$ 7 unchanged. This holds even for non-Markovian, history-dependent shaping rewards as long as the matching function constraints are met. This guarantee subsumes all prior PBRS guarantees—GRM is provably as general as possible under additive, potential-based schemes (Forbes et al., 2024).

3. Implementation in Reinforcement Learning Algorithms

In practice, GRM is incorporated into RL through episodic bookkeeping. At each step, the agent:

Computes/records $R(s,a,s')$ 8 (the raw intrinsic reward).
Computes $R(s,a,s')$ 9 using the matching function via the formula above.
Passes $R + \alpha F_t$ 0 to the RL agent.

A standard variant, D-Delay matching, is defined by a matching function $R + \alpha F_t$ 1 if $R + \alpha F_t$ 2 or (for terminal corrections) $R + \alpha F_t$ 3 and $R + \alpha F_t$ 4; zero otherwise. For stochastic policy optimization (e.g., PPO), an additional reward-matching loss term can be introduced, such as

$R + \alpha F_t$ 5

where $R + \alpha F_t$ 6 is a target return estimate (Villalobos-Arias et al., 26 Jul 2025). The shaping machinery can be integrated in architectures with multiple critics, each estimating different components and collaborating to minimize the matching objective.

4. GRM Beyond Reward Shaping: LLMs and Robotics

The term "GRM" is also used in other contexts:

Generative Reward Modeling (LLMs): Here, GRM denotes models where rewards for rationale evaluation are generated textually and then mapped to scalars via critique extraction and aggregation (Liu et al., 3 Apr 2025). These models, e.g., DeepSeek-GRM, utilize inference-time parallel sampling and meta-reward models for voting, with learning driven by Self-Principled Critique Tuning. This use of GRM is distinct in methodology and objective from GRM in RL reward shaping.
Robotic Manipulation: In the context of process reward modeling, GRM refers to a vision-LLM for dense reward assignment to task progress in robotic episodes. The shaping for policy invariance in Dopamine-RL (built upon this GRM) uses a telescoping potential-based update

$R + \alpha F_t$ 7

guaranteeing the same set of optimal policies and avoiding pitfalls like the semantic trap (Tan et al., 29 Dec 2025).

5. Empirical Results and Applications

Extensive experimentation validates GRM across multiple domains:

In MiniGrid and Cliff Walking, GRM (with tabular count or RND intrinsic reward) avoids reward gaming and procrastination seen with naive IM; convergence to optimal policies is preserved and, frequently, achieved faster (Forbes et al., 2024).
In MiniGrid behavioral studies, GRM mitigates reward hacking by aligning agent behavior to extrinsic-reward baselines, even with strong exploration bonuses (Villalobos-Arias et al., 26 Jul 2025).
In high-precision robotic manipulation, a multi-view, hop-based GRM improves reward accuracy and enables sample-efficient learning; ablating any shaping or fusion part degrades performance (Tan et al., 29 Dec 2025).
In LLM reward modeling, sampling-based GRM and meta-aggregation (e.g., SPCT-GRM-27B with voting) can outperform much larger scalar reward models on several preference and reasoning benchmarks (Liu et al., 3 Apr 2025).

Domain	Role of GRM	Key Property / Outcome
RL Shaping	Converts arbitrary IM into policy-invariant form	Guarantees optimality preservation
LLM Reward	Textual, generative RM with critique sampling	Scalable, error-reduced with inference-time voting
Robotics	Vision-language process reward, shaped policy	Dense, step-aware, policy-invariant rewards

6. Limitations, Open Problems, and Extensions

While GRM is theoretically general for episodic, additive shaping, several challenges remain:

Finite Budget: Policy invariance is strictly realized only in the infinite-sample, perfect-approximation limit; in practice, approximation and sample errors can induce discrepancies between shaped and unshaped policies (Villalobos-Arias et al., 26 Jul 2025).
Parameter Tuning: Selection of the matching kernel ( $R + \alpha F_t$ 8) and delay schedule is often environment-specific and presents a trade-off between effective guidance and bias suppression.
Scalability: In large-scale RL with function approximation, the interaction between shaping and deep network approximation may break certain theoretical assumptions (e.g., constancy of offset in Q-values) (Forbes et al., 2024).
Meta-Learning and Adaptivity: Automatic learning of the matching kernel, adaptation to non-episodic or multi-agent tasks, and integration with large, structured action spaces remain open research avenues.

7. Relationship to Previous and Alternative Methods

GRM strictly generalizes PBRS and all preceding optimality-preserving reward shaping approaches. Specific choices of the matching kernel ( $R + \alpha F_t$ 9) recover known schemes: state-only potentials, state-action/time-dependent potentials, or episodic truncation. Unlike naive IM, which can alter optimality or induce pathological exploration, and unlike methods tied to explicit Markovian structure, GRM's flexible matching enables safe exploitation of arbitrary history-dependent reward signals (Forbes et al., 2024). In applications such as LLMs and high-precision robotics, GRM’s potential-based schemes support sample-efficient and reliable policy optimization that can scale with compute or data, provided policy-invariance is respected.

References:

(Forbes et al., 2024) Forbes et al., "Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards"
(Villalobos-Arias et al., 26 Jul 2025) "Minding Motivation: The Effect of Intrinsic Motivation on Agent Behaviors"
(Liu et al., 3 Apr 2025) "Inference-Time Scaling for Generalist Reward Modeling"
(Tan et al., 29 Dec 2025) "Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation"

Markdown Report Issue Upgrade to Chat

References (4)

Potential-Based Intrinsic Motivation: Preserving Optimality With Complex, Non-Markovian Shaping Rewards (2024)

Minding Motivation: The Effect of Intrinsic Motivation on Agent Behaviors (2025)

Inference-Time Scaling for Generalist Reward Modeling (2025)

Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Reward Matching (GRM).

Generalized Reward Matching in RL

1. Formal Definition and Mathematical Structure

2. Policy Invariance and Theoretical Guarantees

3. Implementation in Reinforcement Learning Algorithms

4. GRM Beyond Reward Shaping: LLMs and Robotics

5. Empirical Results and Applications

6. Limitations, Open Problems, and Extensions

7. Relationship to Previous and Alternative Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Generalized Reward Matching in RL

1. Formal Definition and Mathematical Structure

2. Policy Invariance and Theoretical Guarantees

3. Implementation in Reinforcement Learning Algorithms

4. GRM Beyond Reward Shaping: LLMs and Robotics

5. Empirical Results and Applications

6. Limitations, Open Problems, and Extensions

7. Relationship to Previous and Alternative Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research