Randomized Reward Conditioning

Updated 23 March 2026

Randomized reward conditioning is a reinforcement learning paradigm that conditions policies on randomly sampled reward parameters to induce behavioral diversity.
It employs techniques such as reward-parameterized policies, noise injection, and replay buffer relabeling to enhance exploration, stability, and sample efficiency.
The approach improves zero-shot adaptation and multi-objective balance, with significant applications in control, path planning, and multi-agent systems.

Randomized reward conditioning refers to a class of reinforcement learning, preference optimization, and control methodologies in which agents, policies, or models are explicitly conditioned on parameters of a randomly sampled or otherwise varying reward specification during training. This paradigm induces diversity of behavior, improves generalization, and enables robust adaptation to new or changing reward objectives using a single parameterized agent or model. Randomized reward conditioning is realized in diverse algorithmic forms: as direct input conditioning on reward targets or parameters, random perturbation of rewards, sampling over reward family distributions, or randomized latent preference dimensions. Across canonical RL, diffusion model alignment, optimal stopping, and model-based planning, this approach now underpins multiple state-of-the-art systems.

1. Formal Definitions and Core Principles

Let an environment be specified as a Markov Decision Process (MDP) or a related control setting. The central idea is to regard the reward function, or its parameters, as a variable to be fed as input (condition) to the policy or model. The reward input is sampled, randomized, or otherwise varied during training, yielding a distribution of supervision targets. Several formal instantiations include:

Reward Parameterization: Let $r_\psi(s,a)$ denote a scalar or vector-valued reward indexed by a vector $\psi$ . During each training update, $\psi$ is sampled from a distribution $p(\psi)$ , typically over linear weights, mixture coefficients of reward components, goal variables, or even perturbation noise (Nauman et al., 5 Mar 2026).
Conditional Policy/Model: A policy or model is parameterized as $\pi_\theta(a|s,\psi)$ or $f_\theta(x|\psi)$ , where $\psi$ is concatenated or injected via architectural mechanisms (e.g., FiLM, cross-attention) throughout the network (Kumar et al., 2019, Jang et al., 11 Dec 2025).
Supervised and RL Objectives: Data $(s,a,\psi)$ are constructed by recomputing rewards, trajectory returns, or preference vectors under the randomized $\psi$ . For supervised approaches, the likelihood objective conditions on the sampled $\psi$ ; for RL/PPO/SAC/DDPG, standard value or policy losses are applied with rewards/targets constructed for the current $\psi$ (Nauman et al., 5 Mar 2026, Kumar et al., 2019).
Distributional Range: Sampling may be from distributions covering feasible task parameters, composite objectives, Gaussian perturbations (with or without annealing), categorical indices, or outcome vectors reflecting multi-axis human preferences (Jang et al., 11 Dec 2025).

2. Algorithmic Implementations and Variants

Randomized reward conditioning is instantiated via several algorithmic templates, differing in the structure of the reward parameterization, the nature of conditioning, and the surrounding update scheme:

Reward-Conditioned Policies (RCP, RCP-A): Each trajectory is associated with its empirical return $R(\tau)$ ; the policy $\pi_\theta(a|s,R)$ is trained to reproduce the observed action conditional on $R$ , with higher $R$ values sampled at training to encourage generalization (Kumar et al., 2019). The advantage-conditioned variant computes $R_t = (\sum_{t'=t}^T r_{t'}) - \hat V(s_t)$ .
Reward-Conditioned RL (RCRL): Experience is collected under a nominal reward, but training repeatedly recomputes rewards for off-policy data under randomized $\psi \sim p_\Psi$ , optimizing $\pi_\theta(a|s,\psi)$ and $Q_\theta(s,a,\psi)$ over this distribution. Conditioning is implemented by concatenating $\psi$ or a learned embedding $e_\psi$ to the state input (Nauman et al., 5 Mar 2026).
Reward Randomization for Exploration (RRP, PlanEx): In single-agent RL, additive zero-mean Gaussian or bounded noise is injected into the reward, $r_t^\mathrm{RRP} = r_t + \xi_t$ , or parameteric reward features are perturbed during planning (PlanEx) (Ma et al., 10 Jun 2025, Wang et al., 2023). Noise scale is annealed. Model-based approaches sample random reward perturbation at each episode and solve the resulting planning task.
Multi-Agent Reward Randomization (RPG): In multi-agent Markov games, a vector of reward weights $z$ is sampled across episodes. Strategic diversity is discovered by learning joint policies $\pi_\theta(\cdot|\cdot,z)$ that absorb a spectrum of Nash equilibria over random perturbations of the reward structure (Tang et al., 2021).
Preference-Conditioned Diffusion Model Alignment (MCDPO): In diffusion model RLHF, discrete preference outcome vectors $c \in \{-1,0,+1\}^K$ (reflecting “which sample wins on each axis”) are injected into the model’s attention stacks. Randomized “reward dropout” further masks axes at each step for balancing (Jang et al., 11 Dec 2025).
Randomized Optimal Stopping: The optimal stopping problem is reframed with a randomized (entropy-regularized) stopping intensity $\pi_t$ , resulting in a control problem where the agent’s action is to control the rate of stopping, and the exploration parameter trades off bias and exploration (Dong, 2022).
Path Planning with Randomized Goal-Conditioned Rewards: In end-to-end path and locomotion learning, waypoints or goals are random variables; both rewards and policy inputs are conditioned on current and future goals (Blum et al., 2019).

3. Theoretical Properties and Analysis

Randomized reward conditioning has been subject to both algorithmic ablation and formal analysis:

Exploration Guarantees: Random reward perturbation increases output/model variance and thus function and trajectory diversity. Lemma 2.2 in (Ma et al., 10 Jun 2025) shows that training on perturbed targets expands the variance of the agent’s outputs, yielding provable coverage of the state-action space and improved exploration. In model-based PlanEx, reward randomization constructed with confidence scaling provably yields at least a constant fraction of “optimistic” episodes (partial optimism lemma), ensuring $\tilde O(\sqrt K)$ regret without the need for intractable inner maximization over model sets (Wang et al., 2023).
Generalization and Zero-Shot Adaptation: By training on data recomputed or relabeled for a wide range of reward specifications, a single network can represent behaviors for multiple rewards. In RCRL, the agent demonstrates both zero-shot deployment (immediately adapting to new $\psi$ ) and rapid few-shot fine-tuning, converging to target behaviors after only a small number of environment steps (Nauman et al., 5 Mar 2026).
Preference Alignment and Conflict Resolution: In multi-dimensional RLHF, conflation of multiple reward axes via a scalarization introduces “reward conflict,” where gradient updates on joint preference pairs degrade performance along certain axes. By explicitly conditioning on outcome vectors and applying reward-dimension dropout, each axis can be optimized in its local direction, producing balanced multi-objective solutions and enabling post-hoc axis selection at inference (Jang et al., 11 Dec 2025).
Sample Efficiency and Stability: Empirical and theoretical results show improved sample efficiency under randomized reward conditioning, especially in scenarios of reward misspecification, sparse rewards, or non-stationarity (Kumar et al., 2019, Nauman et al., 5 Mar 2026, Ma et al., 10 Jun 2025).

4. Practical Architectures and Training Procedures

Architectural conditioning and integration of randomized rewards into training pipelines adopt several systematic components:

Input Conditioning: Reward parameters $\psi$ , return values $R$ , or outcome vectors $c$ are concatenated to state inputs or embedded via learned projections. FiLM-style multiplicative conditioning (i.e., scaling hidden units by reward-dependent gates) is empirically superior to simple concatenation for scalar-valued conditioning (Kumar et al., 2019).
Cross-Attention Modulation: For diffusion models and image-generation RLHF, outcome-dependent embeddings are injected via parallel or adapter cross-attention branches (Jang et al., 11 Dec 2025).
Replay Buffer Relabeling: Experience is collected under one reward or task, but replayed and recomputed under many sampled reward parameters. This is crucial for off-policy multi-reward learning in RCRL and related frameworks (Nauman et al., 5 Mar 2026).
Randomization Schedules: Noise amplitude or dropout rate may be annealed over training. In RRP annealing, $\sigma_t$ decays to 0, eventually recovering the true reward and ensuring final convergence (Ma et al., 10 Jun 2025).
Training Algorithms: Standard RL backbones (SAC, PPO, DDPG, AWR) and actor-critic with advantage normalization are adapted to support reward conditioning. In preference learning, the objective is symmetrized DPO with outcome masking (Jang et al., 11 Dec 2025).

5. Empirical Results and Application Domains

Randomized reward conditioning demonstrates efficacy and generality across domains:

Control and Locomotion: On DeepMind Control Suite, HumanoidBench, and MuJoCo tasks, reward-conditioned agents outperform baselines in sample efficiency, robustness, and transferability. Vision-based variants (DrQv2) also benefit from this approach (Nauman et al., 5 Mar 2026).
Multi-Agent Strategic Diversity: RPG discovers multiple human-interpretable equilibria in temporal trust grid-worlds and real-time games (Monster-Hunt, Agar.io), surpassing policy-gradient baselines and unearthing modes such as “corner-wait”, “joint-hunt”, “sacrifice,” and “perpetual cycles” (Tang et al., 2021).
Path Planning: Random waypoint conditioning enables a single deep neural policy to generalize motion control and path following across arbitrary goal sequences, with performance drops observed when goal randomization or observability is ablated (Blum et al., 2019).
Preference Optimization and Alignment: MCDPO achieves superior performance on multi-objective image alignment (e.g., SD1.5, SDXL), with reward-dropout mediating balance across CLIP, aesthetic, human, and other preference scores. Ablation confirms the necessity of reward conditioning for avoiding axis collapse (Jang et al., 11 Dec 2025).
Optimal Stopping with Exploration: Entropy-regularized, randomized stopping yields “soft” threshold policies and efficient policy iteration convergence, interpolating between exploration and deterministic stopping as the temperature parameter $\lambda$ is annealed (Dong, 2022).

6. Limitations, Extensions, and Theoretical Insights

Recognized limitations and possible extensions include:

Pure Randomization vs. Guided Intrinsic Bonuses: For extremely sparse or challenging tasks, purely random reward perturbation may be insufficient; performance may be further improved by hybridizing with structured intrinsic bonuses or novelty estimation (Ma et al., 10 Jun 2025).
Scalability to Very High-Dimensional Reward Spaces: The practical efficacy of conditioning and dropout mechanisms as the reward space grows remains to be fully mapped. MCDPO demonstrates that reward dropout mitigates domination by “easy” axes, but overabundant conflicting signals could require more advanced disentangling architectures (Jang et al., 11 Dec 2025).
Planning with Partial Optimism: PlanEx achieves near-optimal regret without explicit optimism over model sets, but the scaling with horizon $H$ and dimension $d_\phi$ is suboptimal; tightening these bounds and extending results to heavy-tailed noise models is an open direction (Wang et al., 2023).
Richness of Supervision from Conflict Pairs: In multi-dimensional preference RL, retaining conflict pairs as training input (instead of filtering or scalarizing them) is more beneficial under randomized conditioning, as these provide unique local gradient signals that promote axis-level balance (Jang et al., 11 Dec 2025).

A plausible implication is that randomized reward conditioning will underpin future developments in multi-behavioral, preference-aligned, and transfer-robust agents, especially as environments, tasks, or human-aligned objectives become increasingly compositional and high-dimensional.