Multi-Reward Reinforcement Learning
- Multi-reward reinforcement learning is a paradigm using multiple reward signals to guide agents in balancing diverse objectives and complex task requirements.
- It employs methods like scalarization, reward decomposition, and bandit-driven adaptation to address non-linear and multi-dimensional reward challenges.
- Empirical implementations demonstrate improvements in fairness, transfer learning, and language generation by leveraging intrinsic motivation and structured reward decomposition.
Multi-reward reinforcement learning (MRRL) encompasses frameworks where an agent’s performance is evaluated or optimized using multiple reward signals—either simultaneously or as a latent structure embedded in the environment. This paradigm subsumes classic multi-objective RL, reward decomposition, joint optimization of stochastic and non-linear reward functionals, episodic or task-specific reward variation, and learning from human or bandit feedback that encodes composite preferences. Recent progress in MRRL has been driven by needs in fairness optimization, transfer and multi-task RL, multi-criteria language generation, distributional control, and sample-efficient learning under sparse, structured, or non-Markovian feedback.
1. Formal Models and Taxonomy of Multi-Reward Scenarios
MRRL scenarios can be systematically categorized according to the way multiple rewards enter the agent’s objective:
- Vector-valued reward MDPs: The environment exposes a vector at each transition. The agent’s goal may be to maximize a scalarization , often subject to concavity, interpretability, or user directives (Agarwal et al., 2019, Friedman et al., 2018).
- Reward decomposition: The true environment reward is assumed to be a sum of latent components , and the agent may attempt to learn both the decomposition and the control policy, as in distributional or disentangled RL frameworks (Lin et al., 2019, Zhang et al., 2021).
- Reward mixing and partial observability: An unobserved reward function is drawn per episode from a known or unknown mixture (reward-mixing MDPs), resulting in non-Markovian latent dynamics and necessitating specific exploration and identification strategies (Kwon et al., 2021).
- Intrinsically and extrinsically motivated control: Agents simultaneously use extrinsic rewards and independent curiosity-driven or empowerment-based intrinsic signals, often with dynamic or fixed combination rules (Li et al., 2023, Yoo et al., 2022).
- Multi-task and transfer settings: The agent must discover or optimize task-conditional reward functions, policies, and latent intention representations that generalize across tasks or reward decompositions (Yoo et al., 2022).
Scalarization, constraint, and adaptation approaches: The scalar objective may be a fixed or adaptive linear combination (as in bandit-weighted RL (Min et al., 20 Mar 2024)), an explicit non-linear function (e.g., for fairness (Agarwal et al., 2019)), or defined via constraints (e.g., joint threshold satisfaction).
2. Methodologies for Learning with Multiple Rewards
A variety of methodologies have been developed to address MRRL, influenced by the reward structure and the application domain:
- Weighted linear scalarization: The most basic approach is to maximize a linear combination , where are fixed or learned weights (Min et al., 20 Mar 2024, Friedman et al., 2018). Dynamic adjustment of (via contextual/non-contextual bandits) provides adaptation in non-stationary or data-driven contexts (Min et al., 20 Mar 2024).
- Policy gradient for non-linear objectives: For non-linear , gradients are computed using the chain rule and Monte-Carlo estimates of per-objective returns, facilitating model-free optimization (Agarwal et al., 2019).
- Reward decomposition and distributional RL: Sub-reward “channels” are encoded with dedicated network heads and disentanglement regularizers, enabling decomposition of aggregate rewards and specialized sub-policies (Lin et al., 2019). Distributional approaches such as MD3QN model the joint return distribution, capturing both marginalities and higher-order correlations (Zhang et al., 2021).
- Pure exploration and moment-matching: When reward models are latent and randomly drawn per episode (reward-mixing MDPs), joint moments of observed rewards are estimated using augmented MDPs and method-of-moments (including LP and SAT solvers) to identify the latent reward parameters, followed by optimized planning (Kwon et al., 2021).
- Bandit-driven adaptation: Bandit algorithms (Exp3, contextual bandits) guide adaptive reward-weight selection in settings where the “importance” of sub-rewards may shift through training. This mitigates the challenge of manual hyperparameter selection and enables low-regret optimization relative to the best static or dynamic linear weighting (Min et al., 20 Mar 2024).
- Curiosity and intrinsic motivation: Intrinsic Curiosity Modules (ICM) and Go-Explore phases supplement environment rewards with intrinsic bonuses for exploration, improving sample efficiency especially under sparse extrinsic signals (Li et al., 2023).
3. Theoretical Properties and Guarantees
The theoretical landscape of MRRL is shaped by the complexity of inter-reward dependencies, reward observability, and the chosen optimization paradigm:
- Regret bounds for multi-level feedback: When episodic, multi-level feedback is used and modeled via a categorical (softmax) reward model, UCB-style algorithms achieve sublinear regret (Elahi et al., 20 Apr 2025).
- Joint optimization of non-linear objectives: Model-based approaches solve convex programming relaxations and yield regret, where is the Lipschitz constant of and is the MDP diameter. Model-free policy gradient lacks regret guarantees but can maximize non-linear functions in large-scale or continuous domains (Agarwal et al., 2019).
- Sample-complexity for reward-mixing MDPs: The identification and planning procedure in 2-component RM-MDPs requires episodes to reach -optimality, with matching lower bounds in certain cases (Kwon et al., 2021).
- Distributional contraction and disentanglement: Distributional Bellman operators remain contractive (in Wasserstein or MMD metrics) when modeling either decomposed or joint reward distributions; disentanglement regularizers are designed to preserve fixed points while promoting specialization (Lin et al., 2019, Zhang et al., 2021).
- Bandit adaptation regret: Bandit-weighted reward adaptation achieves cumulative regret to the best-in-hindsight static weighting of (Exp3), with further empirical reductions by using context (Min et al., 20 Mar 2024).
4. Empirical Results and Practical Implementations
Methodologies in MRRL have demonstrated empirical gains in diverse domains:
- Language generation: Dynamic reward adaptation (DynaOpt, C-DynaOpt) in RL fine-tuning of LLMs for counseling reflection improves all primary reward metrics (reflection, fluency, coherence) over fixed or statically alternated baselines, as demonstrated both by automatic and human evaluation (Min et al., 20 Mar 2024).
- Fairness in resource allocation: Optimization of non-linear fairness metrics (proportional fairness, -fairness) in cellular scheduling and queueing outperforms DQN, SARSA, and classic heuristics when using model-based or model-free MRRL (Agarwal et al., 2019).
- Multi-agent and curiosity-driven exploration: I-Go-Explore substantially increases sample efficiency and solution quality in sparse-reward, multi-agent competitive tasks compared to pure ICM or standard MADDPG, addressing the detachment problem (Li et al., 2023).
- Distributional reward decomposition: DRDRL and MD3QN outperform Rainbow and HRA on multi-channel Atari and maze benchmarks, learning interpretable and specialized sub-policies that capture both marginal and joint reward structures (Lin et al., 2019, Zhang et al., 2021).
- Transfer and meta-learning: Variational IRL with empowerment-based regularization (SEAIRL) discovers sub-task-reward decompositions that are robust to dynamics variation and enable fast adaptation to unseen environments (Yoo et al., 2022).
- Learning from human feedback: Multi-level feedback models (with or $6$ grades) accelerate convergence in grid-world control compared to binary feedback, especially under realistic human noise (Elahi et al., 20 Apr 2025).
Implementation of these methods often involves architectural innovations (distributional heads, conditional policies, bandit wrappers), non-trivial regularization, and critique of the match between reward formulation and downstream transferability or compositionality.
5. Challenges and Open Problems
MRRL exposes critical challenges:
- Non-Markovianity: Many scalarizations (e.g., nonlinear-in-averages) and multi-level feedback schemes induce non-Markovian reward or value functions. This defeats classic Bellman recursion and necessitates occupancy-measure or history-based approaches (Agarwal et al., 2019, Elahi et al., 20 Apr 2025).
- Partial observability: Mixture or latent reward models require exploration and statistical identification strategies that go beyond standard exploration bonuses or posterior sampling (Kwon et al., 2021).
- Scaling and expressivity: Reward decomposition is computationally expensive and combinatorially complex as the number of channels and total reward dimensions grow (Lin et al., 2019).
- Adaptation to non-stationary user preferences: Fixed-weight scalarizations are often misaligned with true, time-varying importance, motivating ongoing research in automated, low-regret weight adaptation in both RL and bandit frameworks (Min et al., 20 Mar 2024).
- Transfer, generalization, and disentanglement: Discovering transferable, robust decompositions or task-conditioned reward structures remains an open research area, with connections to causality, empowerment, and hierarchical RL (Yoo et al., 2022).
6. Connections to Broader Research and Extensions
MRRL links to related paradigms and methods, pushing the envelope on:
- Multi-objective, constraint, and robust control: MRRL generalizes classical multi-objective RL, incorporating constraints, CVaR or risk sensitive objectives, and providing distributional or multi-dimensional policy representations (Zhang et al., 2021).
- Hierarchical RL and successor features: Reward decomposition naturally aligns with skill-discovery and sub-policy learning; distributional heads or disentanglement losses can enable representation transfer (Lin et al., 2019).
- Exploration and auxiliary feedback: Techniques from intrinsic motivation, curiosity, and empowerment play a key role in MRRL, especially when environmental feedback is sparse, noisy, or unreliable (Li et al., 2023, Yoo et al., 2022).
- Method-of-moments and latent variable learning: The method-of-moments approach to latent reward mixtures is representative of a growing trend toward using weak supervision signals (human feedback, context-mixtures, unlabeled tasks) to learn reward and value representations (Kwon et al., 2021).
- Sample efficiency and prior knowledge: Distributional and decomposition-based methods empirically confer faster learning via better credit assignment and richer shaping signals (Lin et al., 2019, Zhang et al., 2021).
Further extensions include risk-sensitive objective modeling, decentralized/distributed variants, spectrum of reward channel dependencies (from independence to full coupling), and application to domains involving hard alignment, resource constraints, and transfer learning.
MRRL provides a unified foundation for a spectrum of problems where agents must arbitrate among, synthesize, or disentangle rewards from multiple, potentially structured or partially observed sources. Current approaches leverage advances in convex programming, policy-gradient estimation, distributional modeling, bandit adaptation, and latent variable inference to provide both sample-efficient learning and rigorous theoretical guarantees, with open problems centering on scalability, robustness, and cross-domain transfer.