Reinforcement Learning with Transferable Reward

Updated 10 February 2026

RLTR is a framework that decouples dynamics from rewards, enabling policies to rapidly adapt to new reward functions.
It leverages latent representation learning and meta-reward shaping for efficient zero-shot, few-shot, and rapid policy transfer.
Empirical results show significant improvements in steps to optimality and task performance across robotics, control, and vision domains.

Reinforcement Learning with Transferable Reward (RLTR) refers to a family of methods, algorithmic frameworks, and theoretical analyses that enable efficient adaptation to new reward functions or tasks, leveraging prior experience or learned representations, while typically holding core environment dynamics fixed or partially shared. RLTR is motivated by limitations observed in conventional reinforcement learning (RL) and transfer learning, where policies or value functions learned for a specific reward or task often fail to generalize or efficiently adapt when the reward specification changes, even with shared dynamics or structure across tasks. RLTR aims to enable zero-shot, few-shot, or rapid transfer to new rewards, with applications ranging from robotics and control to LLM reasoning and object localization.

1. Formal Problem Setting and Core Principles

The RLTR framework is typically formalized over a family of Markov Decision Processes (MDPs) that share a state space $\mathcal{S}$ , an action space $\mathcal{A}$ , an unknown but fixed transition kernel $\mathcal{P}$ , and a discount factor $\gamma \in [0, 1)$ . Each task $\mathcal{M}_i$ is defined by its own reward function $r_i: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ : $\mathcal{M}_i = \bigl(\mathcal{S}, \mathcal{A}, \mathcal{P}, r_i, \gamma, \mu\bigr)$ Given a batch dataset $D = \{(s_h^m, a_h^m, s_{h+1}^m)\}_{m, h}$ with no reward labels, the central RLTR objective is: Once presented with a new reward $r^*$ , how can one leverage $D$ ("train once, use many") to efficiently construct a near-optimal policy for the new reward, avoiding explicit dynamics modeling or extensive data collection (Chen et al., 2023)? The key technical ideas across RLTR methods are:

Decoupling dynamics from reward, so that changing the reward does not necessitate retraining the dynamics model or the entire RL stack (Zhang et al., 2018, Chen et al., 2023).
Latent/implicit representation learning: building abstracted, dynamics-invariant representations that support recombination across novel rewards (Chen et al., 2023, Zhang et al., 2018).
Transferable reward learning: learning representations—random feature Q-cumulants, metric-learned rewards, meta-reward shaping—which support efficient reward adaptation without reoptimization of core features (Li et al., 2022, Hua et al., 2020, Azran et al., 2023).
Planning or regression-based rapid adaptation: after pretraining, typically only a small-scale regression or online planning step is necessary to adapt to a new reward (Chen et al., 2023, Tang et al., 2022).

2. Methodologies for Transferable Reward Learning

A variety of methodological approaches have emerged for RLTR, including:

Random Feature Model-Free Planning: The RaMP method (Chen et al., 2023) pretrains an ensemble of $K$ Q-cumulant networks using randomly generated features $\{\phi_k\}$ , which serve as pseudo-rewards for each offline trajectory. At test time, the downstream reward $r^*$ is expressed as a linear combination of these features, and the new Q-function is efficiently constructed via ridge regression and combined (via MPC) for rapid policy planning.

Meta-Reward Shaping and Embedding: HMRL (Hua et al., 2020) learns a common meta-state embedding $\phi$ and potential function $\psi$ for reward shaping across environments with heterogeneous states and sparse rewards. This framework leverages MAML-style meta-learning to enable zero-shot transfer and rapid fine-tuning in new environments via embedding-shared potential-based rewards.

Metric Learning for Reward Construction: In object localization, RLTR is realized by learning an ordinal embedding function $f_\theta$ such that reward is determined by the reduction in prototype distance: $R(s_{t-1}, s_t) = \| f_\theta(b_{t-1}) - c \|_2 - \| f_\theta(b_t) - c \|_2$ where $c$ is a prototype embedding from an exemplary set. This reward is fully determined by the new target class or object and is directly reusable or adaptable to new domains and label sets (Li et al., 2022).

Potential-Based Reward Shaping: RLTR often leverages potential-based shaping, where the environment's true reward $r$ is supplemented with a shaped term $F(s, a, s') = \gamma \phi(s') - \phi(s)$ , ensuring policy invariance and facilitating efficient exploration and transfer (Hua et al., 2020, Ferret et al., 2019).

Operator-Based and Functional Generalization: Operator Deep Q-Learning (Tang et al., 2022) constructs neural operator architectures $\mathcal{G}_\theta$ that approximate the entire mapping from arbitrary reward functions $r$ to value functions $q_{*, r}$ , enabling zero-shot evaluation and policy transfer for any new $r$ via a single inference pass.

3. Adaptation and Optimization Procedures

The transfer procedure in RLTR is characterized by (1) decoupled self-supervised or meta-pretraining, and (2) lightweight adaptation for new tasks or rewards:

Offline Pretraining: Embedding networks, reward shaping potentials, random feature families, or Q-operator networks are trained on large, unrewarded (or multi-reward) datasets. No knowledge of target rewards is required.
Online Adaptation: Upon observing $r^*$ , adaptation consists of a small-batch regression (e.g., fitting the weight vector $w^*$ in RaMP (Chen et al., 2023)) or short policy-gradient updates using the task-specific shaped reward or new embedding.
Planning: Model-predictive control (MPC) or exhaustive action-sequence evaluation is often used on top of the constructed Q-function to select actions; this bypasses the need to learn model-based transition rollouts (Chen et al., 2023).
Zero-Shot/One-Shot Transfer: In operator and embedding-based methods (Tang et al., 2022, Hua et al., 2020), adaptation to a new reward is sometimes purely in the forward pass—no further gradient steps are needed.

4. Empirical Results and Cross-Domain Impact

Extensive experiments across RL domains confirm the strong practical benefits of RLTR approaches:

Domain/Benchmark	Transfer Methods	Key Metrics	Representative Result / Advantage
Meta-World, D’Claw, Hopper	RaMP, SF, MBPO	Steps to near-optimality, cumulative reward	RaMP: ~2–3k steps vs MBPO/PETS: ≥10k+; higher final return (Chen et al., 2023)
Multiworld Navigation, Mazes	HMRL, MAML, NoRML	Task steps, zero-shot transfer	HMRL: 30% fewer steps, closes “hard case” gap (Hua et al., 2020)
Corrupted MNIST, CUB-200, COCO	Ordinal-RLTR, TFA	CorLoc, OrdAcc	RLTR adapt: +5–39% CorLoc improvement (Li et al., 2022)
Pendulum, HalfCheetah, Ant	Operator-DQ, SF	MSE, offline episodic return	Operator-DQ: 30–50% lower MSE, sometimes beats online expert (Tang et al., 2022)
LLM reasoning (MATH500, GSM8K)	RLTR vs RLVR	Maj@64, sample efficiency	RLTR: +3.6%p Maj@64, 2.5× fewer steps (Lee et al., 9 Feb 2026)

These gains persist even for high-dimensional continuous actions, vision-based settings (pixel inputs), and meta-RL transfer scenarios. RLTR approaches generally outperform or match reward-conditioned offline RL and model-based RL (MBRL) when test rewards are OOD or label-free.

5. Theoretical Foundations and Guarantees

RLTR is supported by several theoretical analyses:

Expressivity of Q-cumulant bases: For sufficiently large $K$ and expressive random feature maps, any Q-function for a novel reward can be approximated as a linear combination of the pre-trained cumulant Q-functions (Chen et al., 2023), supporting universality claims.
Potential-based shaping preservation: Reward shaping via potentials (e.g., $F(s,a,s')$ ) guarantees policy invariance, entailing that transfer does not introduce new local optima (Hua et al., 2020, Ferret et al., 2019).
Operator consistency: Neural operator architectures are designed to match the true Bellman resolvent, yielding strong monotonicity and transfer properties (Tang et al., 2022).
Transfer error bounds: For IRL-based recovery of reward functions from multiple experts with diverse transition laws, transferability suffices if the “principal angle” between potential-shaping subspaces is bounded away from zero. This yields quantitative bounds on sub-optimality in new environments (Schlaginhaufen et al., 2024).

Notably, recent work has extended transferability theory to situations where only expert demonstrations (not policies) are available, using principal angle and PAC-style algorithms to guarantee recovery of approximately optimal transferable rewards (Schlaginhaufen et al., 2024). In operator-based (Tang et al., 2022) and decoupled architectures (Zhang et al., 2018), invariance is established via architecture design and empirical validation.

6. Limitations, Open Problems, and Research Directions

RLTR, despite strong advances, faces several challenges and open questions:

Variance and Model Complexity: Monte Carlo-based cumulant estimation may introduce high variance; large $K$ (number of random features) increases representation power but can slow adaptation (Chen et al., 2023).
Reward Class Coverage: Operator-based transfer requires an adequate sampling of reward distributions during pretraining; for underrepresented or misaligned rewards, transfer will underperform (Tang et al., 2022).
Irreducible Transfer Gaps: For IRL, complete transfer requires nontrivial geometric alignment between demonstration-generated potential-shaping subspaces; in practice, small angles or reward class mismatches introduce residual sub-optimality (Schlaginhaufen et al., 2024).
Multi-objective and Multi-agent Extensions: Extending RLTR frameworks to simultaneously handle multi-objective, multi-agent, or dynamically-varying environments remains an important direction (Ardon et al., 2024).
Pixel-level Transfer: Methods based on random features or embedding often degrade under noisy visual input, as the “signal” in random features is suppressed (Chen et al., 2023).

Ongoing areas of research include dynamically adaptive random feature selection, joint end-to-end embedding-policy training, multi-head operator learning, improved reward-shaping and meta-regularization, and scalable symbolic abstraction discovery (Azran et al., 2023, Ardon et al., 2024). For LLMs, transfer-based reasoning rewards are enabling improved robustness, sample efficiency, and process-level alignment (Lee et al., 9 Feb 2026).

7. Representative RLTR Algorithms and Prototypical Procedures

Selected RLTR methods exemplify the general paradigm through distinctive workflows:

RaMP (Chen et al., 2023): Offline pretrain Q-cumulant networks over random features; at adaptation, perform ridge regression to fit a new reward as a linear combination of features; plan using MPC atop the implicit model.
Operator Deep Q-Learning (Tang et al., 2022): Train neural operator to approximate the value mapping for arbitrary rewards; at test time, input a new reward vector, and act greedily via the predicted $Q_{*,r^*}$ —no gradient steps needed.
HMRL (Hua et al., 2020): Meta-learn a cross-task embedding and potential, meta-optimize policy using shaped returns, and fine-tune (or even zero-shot transfer) to novel sparse-reward tasks.
Ordinal-RLTR (Li et al., 2022): Learn metric-aligned reward via triplet embedding; at test time, re-compute the prototype on a small support set, and adapt policy via policy-gradient steps.
Potential-based reward shaping with credit assignment (Ferret et al., 2019): Offline supervised learning of attention-based credit assignment; at transfer, use computed potentials for policy-invariant reward reshaping.
Ensemble-based IRL (DARL) (Luo et al., 2022): Dynamics-agnostic reward learning ensures policy and transition-invariant reward recovery, suitable for imitation and robust transfer.

References

"Self-Supervised Reinforcement Learning that Transfers using Random Features" (Chen et al., 2023)
"HMRL: Hyper-Meta Learning for Sparse Reward Reinforcement Learning Problem" (Hua et al., 2020)
"Learning Transferable Reward for Query Object Localization with Policy Adaptation" (Li et al., 2022)
"Student/Teacher Advising through Reward Augmentation" (Reid, 2020)
"Rethinking Adversarial Inverse Reinforcement Learning: Policy Imitation, Transferable Reward Recovery and Algebraic Equilibrium Proof" (Zhang et al., 2024)
"Contextual Pre-planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning" (Azran et al., 2023)
"Operator Deep Q-Learning: Zero-Shot Reward Transferring in Reinforcement Learning" (Tang et al., 2022)
"Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning" (Schlaginhaufen et al., 2024)
"Decoupling Dynamics and Reward for Transfer Learning" (Zhang et al., 2018)
"Deep Transfer Q-Learning for Offline Non-Stationary Reinforcement Learning" (Chai et al., 8 Jan 2025)
"FORM: Learning Expressive and Transferable First-Order Logic Reward Machines" (Ardon et al., 2024)
"Self-Attentional Credit Assignment for Transfer in Reinforcement Learning" (Ferret et al., 2019)
"Learning Multi-Task Transferable Rewards via Variational Inverse Reinforcement Learning" (Yoo et al., 2022)
"Beyond Correctness: Learning Robust Reasoning via Transfer" (Lee et al., 9 Feb 2026)
"Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble" (Luo et al., 2022)
"Centralized Reward Agent for Knowledge Sharing and Transfer in Multi-Task Reinforcement Learning" (Ma et al., 2024)