Reward-Conditioned Policies

Updated 4 February 2026

Reward-conditioned policies are supervised approaches in RL that explicitly condition actions on reward signals such as return-to-go or advantage.
They leverage data tuples (state, action, reward) to perform maximum likelihood estimation, smoothly integrating reinforcement and imitation strategies.
These policies enable scalable tuning across domains from robotics to language model tool use, while addressing challenges like reward design sensitivity.

Reward-conditioned policies (RCPs) define a supervised learning paradigm within reinforcement learning (RL) wherein the policy is explicitly conditioned on either past, target, or desired reward signals. Rather than optimizing for maximal returns via classical value learning or policy gradients, RCPs leverage data tuples of trajectory, state, or action paired with reward (or advantage) labels to learn a strong mapping from context and return to behavior. This framework smoothly interpolates between reinforcement and imitation learning and forms the basis for scalable and tunable policy classes in both standard Markov decision processes (MDPs) and settings such as LLM tool use, bandits, and robotics (Kumar et al., 2019, Xu et al., 2024, Wei et al., 2021, Akbulut et al., 2020, Zhong et al., 3 Feb 2026).

1. Formal Definition and Mathematical Framework

The canonical objective in reward-conditioned policy learning is to parameterize a policy $\pi_\theta(a \mid s, Z)$ , where $a$ is an action, $s$ is a state (or context), and $Z$ is a reward-related conditioning variable, such as return-to-go, total trajectory reward, or advantage. At training, either rollout- or batch-based data $\mathcal{D} = \{ (s, a, Z) \}$ are used to fit $\theta$ by standard negative log-likelihood minimization:

$L(\theta) = -\mathbb{E}_{(s,a,Z)\sim \mathcal{D}} [\log \pi_\theta(a \mid s, Z)]$

This is in contrast to policy gradient or Q-learning, which optimize RL-specific objectives using value functions or Monte Carlo estimates. For simple bandit cases, this reduces to $\pi_\theta(a \mid r)$ , i.e., learning the conditional distribution of actions given observed reward $r$ (Xu et al., 2024).

The theoretical justification arises from constrained policy improvement. For instance, conditioning on Z is equivalent to solving:

$\max_{\pi} \:\mathbb{E}_{\tau,Z \sim p_\pi} [Z] \quad \text{s.t.} \quad KL(p_\pi(\tau,Z) \| p_\mu(\tau,Z)) \leq \epsilon$

The Lagrangian solution yields an optimal joint distribution with exponential reweighting over $Z$ :

$p_{\pi^*}(\tau,Z) \propto p_\mu(\tau,Z) \exp(Z/\beta)$

(Kumar et al., 2019).

2. Variants and Inference Procedures

Reward-conditioned policies support several instantiations and extensions. In MDP settings, variants include:

Return-conditioned ( $\mathrm{RCP}$ - $\mathrm{R}$ ): $Z$ is the reward-to-go at timestep $t$ .
Advantage-conditioned ( $\mathrm{RCP}$ - $\mathrm{A}$ ): $Z$ is the advantage $Q(s,a) - V(s)$ .
Discrete reward tokens: Instructing LLMs with special tokens such as $<$ |high_reward| $>$ or $<$ |low_reward| $>$ , enabling trajectory generation at distinct quality levels (Zhong et al., 3 Feb 2026).

For multi-armed bandits (MAB), inference in RCPs typically requires marginalization over $r$ : $\pi^\dagger(a) = \sum_r \pi_\theta(a \mid r)\,q(r)$ where $q(r)$ is a target reward distribution, or, in more advanced approaches, a learned signed weight function $w_\phi(r)$ with $\sum_r w_\phi(r) = 1$ , enabling negative weights to subtract out low-reward behaviors and yield more distinct and effective exploitation policies (Xu et al., 2024).

In reward-parameterized RL, policies $\pi_\phi(a \mid s, \theta)$ use condition vectors for each axis of the reward function, normalized as $c = M(\theta_{-1}) \in [-1,1]^{k-1}$ ; the agent receives $[o_t, c]$ as input (Wei et al., 2021). This method generalizes rapidly across reward landscapes.

For population-based trajectory optimization, RC-NMP (Akbulut et al., 2020) uses latent variable models to generate trajectories given context and reward, combined with evolutionary sampling and crossover in latent space.

3. Algorithmic Frameworks

Generalized RCP algorithmic workflow involves:

Data relabeling: Each trajectory (or action selection) is relabeled with observed or target rewards/advantages.
Conditional supervised learning: Maximum likelihood estimation for $\pi_\theta(a|s,Z)$ or $\pi_\theta(a|r)$ .
Target value/model update: Track empirical distributions for $Z$ (re-weighted toward higher returns, e.g., via $q(Z)\propto p(Z)\exp(Z/\beta)$ ).
Inference marginalization: At test time, select or average over desired target return $Z$ or reward $r$ .
Optional: decoupled inference policy: Learn or optimize weights $w_\phi(r)$ for compositional marginalization (Xu et al., 2024).

In domains with sparse, multimodal, or non-stationary rewards, population-based RCP strategies initialize with demonstrations (or random exploration), progressively increase the target conditioned reward $R^*$ , and generate diverse trajectory sets via stochastic latent sampling and evolutionary operators (Akbulut et al., 2020).

For LLMs in tool-using benchmarks, the RC-GRPO algorithm (Zhong et al., 3 Feb 2026) combines reward-conditioned trajectory pretraining with group-variance preserving policy optimization, maintaining high within-group reward variance by actively sampling diverse reward tokens at rollout time, thus mitigating the gradient collapse observed in standard relative policy optimization.

4. Empirical Evaluations

Reward-conditioned policies have been evaluated in a range of domains:

Continuous control (MuJoCo tasks): RCP-A matches or outperforms TRPO/PPO and batch-AWR baselines in online/offline settings, and achieves good generalization when conditioned on out-of-distribution returns, though still lags behind soft actor-critic (SAC) on ultimate performance (Kumar et al., 2019).
Multi-armed bandits: RCPs using “SubMax” (compositional marginalization via positive differences only) outperform classical algorithms such as UCB1 and Thompson sampling for large K or sparse, delayed rewards. Generalized marginalization further improves cumulative regret and adapts to combinatorial action spaces (Xu et al., 2024).
Multi-turn LLM agents: RC-GRPO yields substantial accuracy gains on the BFCLv4 benchmark (e.g., 85% for Qwen2.5-7B-Instruct with RCTP-FT+RC-GRPO vs. 48.75% with SFT+GRPO and 61.25% for the best closed API model), and preserves within-group advantage variance at low entropy, confirming that reward diversity is injected through token conditioning, not stochastic sampling (Zhong et al., 3 Feb 2026).
Robotic movement generation: RC-NMP achieves order-of-magnitude sample efficiency gains, multi-modal trajectory diversity, and robust real-world execution in manipulation and obstacle avoidance, compared to movement-primitive RL baselines (Akbulut et al., 2020).

5. Theoretical Analysis and Guarantees

Reward conditioning, by construction, prevents the collapse of learning signal in both supervised and RL settings with sparse or binary rewards. For example, in RC-GRPO, the within-group variance is analytically lower-bounded if the expected reward under distinct conditioning tokens differs by at least $\epsilon > 0$ :

$\mathbb{E}[\sigma_g^2] \geq (G-1)^{-1}p(1-p)\epsilon^2 > 0$

This mechanism maintains non-vanishing policy gradients even when policies become sharply peaked on optimal demonstrations (Zhong et al., 3 Feb 2026). In bandits, the use of signed weight functions for marginalization allows the RCP to explicitly “subtract” low-reward behaviors, geometrically forming distinct optimal mixtures (Xu et al., 2024).

In RL, reward-conditioned policies correspond to a supervised maximum-likelihood projection of the exponentiated-reward reweighted trajectory distribution under a KL constraint, formalizing their connection to generalized advantage-weighted regression (Kumar et al., 2019).

6. Extensions, Limitations, and Future Directions

Reward conditioning offers a conceptual and practical alternative to both classic RL and imitation learning, but presents unique limitations:

Conditioning collapse: When conditioning on out-of-support rewards, policy behavior can become unstable; extrapolation beyond observed $Z$ or $r$ requires either richer data or careful weight learning (Kumar et al., 2019, Akbulut et al., 2020).
Reward design sensitivity: The utility of RCPs depends on informative, well-distributed rewards, particularly with direct reward tokens or parameter vectors (Wei et al., 2021, Zhong et al., 3 Feb 2026).
Scaling: For high-dimensional observations and very long horizons, reward-conditioned approaches may underperform methods optimized for value-based credit assignment, especially in the absence of effective target distribution adaptation (Kumar et al., 2019).
Inference-time optimization: Decoupling inference and training, as in generalized marginalization or reward parameter search, allows post-hoc tailoring, but full theoretical regret guarantees remain an open area (Xu et al., 2024).

Recent work posits natural extensions to continuous-action spaces, hierarchical reward structures, and integrating reward-conditioning with RL-specific variance reduction techniques and exploration regularizers. Population-based and evolutionary variants exploit RC policy classes for highly multimodal and sparse-reward robotic or language-generation tasks (Akbulut et al., 2020, Zhong et al., 3 Feb 2026).

7. Summary Table: Representative RCP Methods and Domains

Method/Domain	Conditioning Mechanism	Notable Features
RCP-R/A (Kumar et al., 2019)	Return/Advantage-to-go	Supervised MLE, competitive on RL tasks
RC-GRPO (Zhong et al., 3 Feb 2026)	Discrete reward tokens	Prevents gradient collapse, LLM RL
Generalized Marginalization	Signed weights on rewards	Enhanced MAB regret, compositional
RC-NMP (Akbulut et al., 2020)	Reward-latent conditioning	Evolutionary trajectory optimization
cDRL (Wei et al., 2021)	Reward-parameter vectors	Hindsight tweaking with fast search

Reward-conditioned policies constitute a flexible supervised RL paradigm, subsuming and extending standard RL, bandit, imitation, and population-based learning with explicit return- or reward-variable control and robustness to reward structure and data modality.