Multi-Reward Reinforcement Learning (MRRL)

Updated 12 January 2026

MRRL is a framework that extends traditional RL by using vector rewards to handle multiple, often conflicting, objectives in a single decision process.
It employs various scalarization techniques—linear, non-linear, and dynamic—to convert multi-dimensional rewards into a single actionable signal, enabling adaptive policy control.
MRRL methods have shown enhanced sample efficiency and performance improvements in diverse domains, from continuous control to dialogue systems and language generation.

Multi-Reward Reinforcement Learning (MRRL) is a branch of reinforcement learning (RL) focusing on optimization, modeling, and control when the agent’s objective incorporates multiple, often conflicting, reward signals. Unlike standard RL, which targets a single scalar reward, MRRL explicitly formulates tasks with vector-valued, decomposable, or joint reward structures. This class encompasses linearly-weighted multi-objective RL, distributional methods for multi-dimensional returns, non-linear reward aggregation, and dynamic reward-weight adaptation.

1. Formal Problem Definitions and Core MRRL Frameworks

In MRRL, the Markov Decision Process is extended so that the reward function returns a vector $r(s,a) \in \mathbb{R}^m$ of $m$ distinct components rather than a scalar. Each component typically represents a separate criterion (e.g., task success, time cost, safety constraint, fairness):

Vector reward: $r(s,a) = [r_1(s,a), ..., r_m(s,a)]$
Policy: $\pi(s)$ (or, in parameterized MRRL, $\pi(s, w)$ for a weight vector $w$ )
Long-term return vector: $R_\pi = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)$

Common MRRL approaches transform this vector reward into a scalar via a scalarization function—most often a linear combination:

$r_w(s,a) = \sum_{i=1}^m w_i r_i(s,a)$

where $w \in \Delta^{m-1}$ (the probability simplex), allowing dynamic trade-offs between objectives.

Alternatively, the objective may be a non-linear function $f: \mathbb{R}^m \to \mathbb{R}$ of the expected long-term returns, or one may seek to model the joint distribution of returns across multiple reward channels.

2. Algorithmic Approaches and Scalarization Strategies

Linearly-Scalarized Policy Learning

Methods such as the extended HER-based MRRL (Friedman et al., 2018) and MO-GPSARSA (Ultes et al., 2017) employ a linearly-weighted combination of vector rewards. The key innovations are:

Treating reward weights $w$ as inputs to the policy and/or value function.
Data augmentation via sampling auxiliary weights per transition and storing multiple versions of experiences in the replay buffer.
Off-policy learning (e.g., DDPG-style) with modified Bellman updates:

$Q^*(s, a, w) = \mathbb{E}_{s'} [r_w(s,a) + \gamma \max_{a'} Q^*(s', a', w)]$

Networks are conditioned on both the state and the weight vector $w$ .

This paradigm enables a single policy $\pi(s, w)$ to generalize over all possible trade-offs between reward components, allowing for real-time adjustment at deployment without retraining (Friedman et al., 2018).

Non-linear and Arbitrary Aggregation

MRRL also covers cases where the objective is an arbitrary (often concave) function of the long-term average reward vector, $f(R(\pi))$ . In such settings, classical dynamic programming is not directly applicable due to non-additivity. (Agarwal et al., 2019) formulates the problem as a convex program over steady-state occupancy measures and presents both model-based (posterior sampling + convex optimization) and model-free (joint policy-gradient) algorithms with theoretical regret bounds, specifically for optimizing arbitrary concave, $L$ -Lipschitz functions.

Dynamic Reward Mixture Adjustment

For settings where the optimal mixing weights between reward components are unknown and may need to change during training, bandit-driven adaptive mixtures have been proposed. Methods such as DynaOpt and C-DynaOpt (Min et al., 2024) employ non-contextual (Exp3) and contextual (Online Cover) bandits to adapt reward weights $W$ online:

The combined reward for a sample is $R_{comb}(c, y; W) = \sum_{i=1}^N W_i r_i(c, y)$ .
Weight updates are guided by bandit feedback, modulating the inner RL policy's updates.
Bandit selection may either be stateless (Exp3) or context-sensitive (context and recent reward statistics via Online Cover).

This dynamic adjustment systematically outperforms static uniform or round-robin weighting in language generation tasks with multiple objectives (Min et al., 2024).

3. Distributional Methods and Multi-Reward Decomposition

Distributional MRRL approaches retain the return distribution structure of each reward signal, enabling richer modeling of joint reward statistics and credit assignment.

Multi-Dimensional Distributional DQN (MD3QN) (Zhang et al., 2021) models the joint distribution $\mu_\pi(s, a) \in \mathcal{P}(\mathbb{R}^d)$ over long-term returns for all $d$ reward sources. Empirical loss uses Maximum Mean Discrepancy (MMD) between sampled return vectors and Bellman targets. Proven contraction of the joint Bellman operator in Wasserstein distance is established, and learning matches or outperforms scalar and per-head baselines when rewards exhibit complex dependencies.
Distributional Reward Decomposition for RL (DRDRL) (Lin et al., 2019) learns $N$ parallel return distributions (one per hypothesized reward channel) and reconstructs the full return via convolution. A KL-based disentanglement penalty encourages networks to specialize per sub-channel while avoiding trivial decompositions.

The following table summarizes several MRRL algorithmic paradigms:

Approach	Reward Handling	Policy Input
Scalarized DDPG (MRRL) (Friedman et al., 2018)	Linear weighted sum	$(s, w)$
MO-GPSARSA (Ultes et al., 2017)	Linear weighted sum	$(b, a, w)$
MD3QN (Zhang et al., 2021)	Joint distribution	$s, a$
DRDRL (Lin et al., 2019)	Parallel distributional	$x, a$ (by channel)
DynaOpt/C-DynaOpt (Min et al., 2024)	Dynamic linear mixture	$c, y$ (language)
Nonlinear joint optimization (Agarwal et al., 2019)	General $f(R)$	$s$

4. Experimental Domains, Empirical Findings, and Metrics

Continuous Control: Double-integrator problems (1D/2D) with time and fuel rewards. The MRRL policy $\pi(s,w)$ smoothly tracks analytic Pareto fronts, interpolating between time- and fuel-optimal behavior (Friedman et al., 2018).
Spoken Dialogue Systems: Balancing dialogue success and length in statistical SDS across six domains. MO-GPSARSA achieves policy efficiency, matching SO-RL with 3x fewer samples, and demonstrates optimized performance curves as a function of weight (Ultes et al., 2017).
Fairness and Resource Allocation: Scheduling and queueing environments with up to 8 reward channels. Model-based and policy-gradient MRRL are superior to classical DQN and vanilla policy gradients for proportional-fairness and $\alpha$ -fairness (Agarwal et al., 2019).
Atari and Pixel-based Tasks: MD3QN and DRDRL improve both representation of reward dependencies and control performance over scalar and independent RL baselines, particularly when joint statistics or reward decomposition informs strategy (Zhang et al., 2021, Lin et al., 2019).
Language Generation: DynaOpt/C-DynaOpt increase target metrics (e.g., reflection quality) while maintaining other qualities (fluency, coherence) relative to uniform or alternate reward training (Min et al., 2024).

Metrics vary by domain: task-specific returns, Pareto distance to analytic solutions, joint distribution quality (MMD error), dialogue success rate, task length, sample efficiency, and human evaluation scores.

5. Practical Considerations, Insights, and Limitations

Several implementation insights and limitations have emerged:

Proper data augmentation rate ( $k$ ) is crucial; too little harms generalization, too much slows learning (Friedman et al., 2018).
Uniform sampling of weights in the simplex achieves broad coverage; for scalarization-based methods, post-training test-time adaptation is possible with a universal policy (Friedman et al., 2018, Ultes et al., 2017).
Bandit-based dynamic mixtures shift the reward emphasis as training progresses, avoiding manual tuning (Min et al., 2024).
Factored distributional models (DRDRL) can outperform joint models for practical sample complexity reasons if reward channels are only weakly coupled (Lin et al., 2019).
For arbitrary non-linear reward aggregation, convex optimization over steady-state occupancy enables tractable solutions and theoretical regret analysis, but scaling to large state-action spaces may require function approximation and loses regret guarantees (Agarwal et al., 2019).
There is no single best MRRL paradigm; the structure of reward correlations, the functional form of the objective, and application constraints determine the most suitable approach.

Open problems include robust generalization in high reward dimensions, restriction of computational overhead of sample-based or kernel methods, principled exploration in multi-reward regimes, and formally linking dynamic bandit-based mixture policies to theoretical optimality guarantees.

6. Broader Interpretations and Future Directions

MRRL subsumes multi-objective RL, reward decomposition, joint distributional modeling, and dynamic reward fusion strategies. The principal advantages are:

Enabling real-time adaptation to user preferences without retraining (Friedman et al., 2018).
Efficient exploitation of structure in multi-channel environments (credit assignment, exploration, constraint satisfaction) (Zhang et al., 2021, Lin et al., 2019).
Sample-efficient optimization of complex, possibly non-linear, metrics (fairness, diversity, safety) (Agarwal et al., 2019).
Algorithm-independent mechanisms (e.g., bandit-guided reward mixing) for adaptive control over learning trade-offs in generative tasks (Min et al., 2024).

Promising directions include improved adaptive reward mixing via meta-bandits or Bayesian optimization, scalable joint distributional estimation, function approximation for occupancy-measure planning, and generalization to broader classes of non-Markovian or context-dependent objectives.

7. Key References

"Generalizing Across Multi-Objective Reward Functions in Deep Reinforcement Learning" (Friedman et al., 2018)
"Reward-Balancing for Statistical Spoken Dialogue Systems using Multi-objective Reinforcement Learning" (Ultes et al., 2017)
"Distributional Reinforcement Learning for Multi-Dimensional Reward Functions" (Zhang et al., 2021)
"Distributional Reward Decomposition for Reinforcement Learning" (Lin et al., 2019)
"Reinforcement Learning for Joint Optimization of Multiple Rewards" (Agarwal et al., 2019)
"Dynamic Reward Adjustment in Multi-Reward Reinforcement Learning for Counselor Reflection Generation" (Min et al., 2024)