MORSE: Multi-Objective Reward Shaping

Updated 24 December 2025

The paper introduces MORSE, which integrates multi-objective reward shaping with explicit exploration to enhance RL agents’ performance across varied objectives.
It employs bi-level optimization and dynamic reward weighting to adaptively balance exploitation and exploration in complex environments.
Empirical results demonstrate that MORSE improves robustness and sample efficiency in continuous control, robotics, and safety-critical applications.

Multi-Objective Reward Shaping with Exploration (MORSE) refers to a family of principled reinforcement learning (RL) frameworks that combine multi-objective reward shaping with explicit exploration mechanisms. These approaches enable RL agents to efficiently balance and optimize multiple objectives—such as task completion, safety, efficiency, and information acquisition—while systematically structuring both reward signals and exploration incentives. MORSE techniques use bi-level or dynamic optimization of reward aggregation, integrate stochasticity and novelty into reward-space search, and provide provable guarantees under certain Bayesian or potential-based formulations. Recent theoretical and empirical work demonstrates MORSE-based algorithms achieving superior robustness and sample efficiency across continuous control, robotic, and preference-alignment domains (Xie et al., 17 Dec 2025, Lu et al., 14 Sep 2025, Lidayan et al., 9 Sep 2024).

1. Mathematical Foundations of MORSE

In multi-objective RL, the agent operates in a Markov decision process (MDP) or Bayes-Adaptive MDP (BAMDP) characterized by a reward vector $R(s,a) \in \mathbb{R}^K$ for $K$ objectives. Classic scalarization schemes form $r^w_t = \sum_{i=1}^K w_i r^i_t$ for $w \in \Delta^K$ . Fixed-weight scalarization cannot capture non-convex Pareto fronts, prompting the need for more adaptive reward weighting and shaping (Lu et al., 14 Sep 2025).

A key insight central to MORSE research is the aggregation of extrinsic and intrinsic rewards to drive both exploitation (optimizing for task-specific objectives) and exploration (maximizing information gain or novelty). In particular, the Bayes-Adaptive Potential-Based Shaping Framework (BAMPF) formalizes pseudo-rewards:

$F((b,s), (b',s')) = \gamma W(b',s') - W(b,s)$

where $W(\cdot)$ is a multi-objective potential function decomposed as $W(b,s) = \alpha_{\mathrm{phys}} W_{\mathrm{phys}}(b,s) + \alpha_{\mathrm{info}} W_{\mathrm{info}}(b,s)$ . The shaped reward at each transition is:

$R_{\mathrm{shaped}} = R(s,a) + \gamma W(b',s') - W(b,s)$

This construction provably preserves Bayes-optimality and corrects for learner misestimation in both exploitation and exploration terms (Lidayan et al., 9 Sep 2024).

2. MORSE as Bi-Level and Dynamic Optimization

Contemporary MORSE algorithms formulate reward shaping as a bi-level optimization problem. The inner loop trains a policy $\pi_\theta$ using the current shaped reward, while the outer loop adapts the shaping parameters (often the reward weights) to optimize the true task objective. This two-timescale structure is formalized as:

Inner: $\theta^*(\phi) = \arg\max_\theta \mathbb{E}_{\pi_\theta} [\sum_t \gamma^t R_\phi(s_t,a_t)]$
Outer: $\phi^* = \arg\max_\phi \mathbb{E}_{\pi_{\theta^*(\phi)}} [\sum_t \gamma^t R_{\mathrm{task}}(s_t,a_t)]$

where $R_\phi(s,a) = w_\phi(s)^\top R(s,a)$ . Gradients for the outer loop are computed via implicit differentiation or Neumann-series approximations of the Hessian inverse (Xie et al., 17 Dec 2025).

To address local minima and improve adaptability, dynamic reward weighting mechanisms are integrated. These can be hypervolume-guided—maximizing the increase in covered Pareto front volume as new policies are discovered—or gradient-based, using meta-objectives to tune weight importance based on gradient influence signals. The latter adjusts $w$ online through entropy-regularized mirror descent, allocating learning effort toward objectives showing maximal potential improvement (Lu et al., 14 Sep 2025).

3. Exploration Strategies in MORSE Frameworks

MORSE frameworks introduce exploration not solely in the state-action space but also in the reward-weight parameter space. Techniques for reward-space exploration include:

Stochastic reward perturbation: Adding zero-mean Gaussian noise to shaped rewards during policy rollouts, with the variance often adaptive to novelty.
Random Network Distillation (RND) for weight novelty: Maintaining target and predictor networks over $w$ to score the novelty of candidate reward-weight vectors. New shaping weights are sampled with probability proportional to their predicted novelty, encouraging coverage of under-explored regions (Xie et al., 17 Dec 2025).
Performance-gated resets: Triggering re-sampling of shaping parameters when empirical task success shows no improvement over a sliding window, or with probability governed by recent success rates.

This approach enables MORSE algorithms to escape poor local optima in non-convex reward landscapes and adaptively discover high-performing weight configurations, which static or periodic exploration fail to achieve.

4. Integration of Human Heuristics and Safety Objectives

MORSE directly incorporates multiple human-designed heuristic objectives by stacking dense auxiliary signals (e.g., control costs, kinematic penalties) along with sparse task-completion rewards. A state-conditioned linear or non-linear weighting function aggregates these into a unified shaped reward, typically implemented as a small neural network. The approach requires minimal manual tuning and can efficiently adapt weightings to the evolving demands of the learning task (Xie et al., 17 Dec 2025).

In settings where safety constraints are critical, related frameworks integrate multi-objective shaping with action-space reshaping. Here, exploration is constrained by learned or hard-coded safety functions, enforcing that policy proposals reside within admissible regions. Such constrained exploration maintains sample efficiency and prevents undesired behaviors without manual penalty balancing (Pham et al., 2018).

5. Theoretical Guarantees and Robustness

Potential-based shaping in the BAMDP framework ensures that the optimal policy of the original MDP is also optimal under the shaped reward. The BAMDP Potential-Based Shaping Theorem provides that a shaping function $F$ preserves Bayes-optimality if and only if $F$ is a BAMPF. For suboptimal learners, shaping significantly speeds convergence without risk of reward-hacking—convergence to behaviors maximizing only the composite reward at the expense of the true objective. Finite-horizon regret bounds quantify potential increases in suboptimality due to terminal potential misestimation, but these remain dominated by the problem’s intrinsic horizon (Lidayan et al., 9 Sep 2024).

Dynamic reward weighting and exploration in reward space do not formally guarantee global optimality, owing to inherent non-convexity, but empirical results and local convergence statements under standard smoothness and strong convexity conditions are reported (Xie et al., 17 Dec 2025, Lu et al., 14 Sep 2025).

6. Practical Implementation and Empirical Results

MORSE algorithms are compatible with standard on-policy RL algorithms (e.g., PPO, GRPO, REINFORCE, RLOO), requiring minor modifications: tracking per-objective returns, parameterizing the weight update mechanism, and resetting or perturbing reward parameters as exploration dictates. Key hyperparameters include learning rates for policy and weight updates, Neumann approximation order, exploration interval, and novelty metric architecture.

Empirical evaluations demonstrate MORSE’s superiority or parity with expert-tuned baselines on synthetic optimization landscapes, continuous-control MuJoCo tasks, and Isaac Sim robotics environments. Specifically:

In MuJoCo “hard” tasks, MORSE achieves success rates $0.90-0.98$ after training, outperforming vanilla bi-level gradient approaches ($0.60-0.75$) and matching oracle baselines ($0.95-0.99$).
On synthetic test functions, RND-guided MORSE outperforms random sampling, CEM, and CMA for finding high-reward regions in weight space under tight sample budgets.
In BAMDP settings, MORSE reduces steps to solution (e.g., Mountain Car solved in $\sim 30$ k steps vs. $100$k without shaping), and in Bernoulli bandits, achieves $O(\log T)$ regret with low constants, matching Bayes-optimality (Lidayan et al., 9 Sep 2024, Xie et al., 17 Dec 2025).

A summary table of core MORSE algorithmic components is provided:

Component	Mechanism	Role
Reward aggregation	State-conditioned linear or neural $w_\phi$	Multi-objective shaping
Exploration in $w$ -space	RND, stochastic resets, performance gating	Escape local minima
Bi-level optimization	Inner RL under $R_\phi$ , outer hypergradient on $\phi$	Task-driven adaptation
Dynamic weighting	Hypervolume/gradient-based $w_t$ updates	Pareto front coverage
Theoretical guarantee	BAMDP/BAMPF preserves Bayes-optimality (constant shift)	Robustness

7. Insights, Limitations, and Recommendations

Empirical ablations confirm the necessity of gradient-based outer loops, novelty-driven exploration, performance gating, and policy resets. Removing any of these degrades sample efficiency or fails to escape local optima. Sample complexity remains an issue: each reward-weight candidate entails a (partial) policy retraining. In high-dimensional weight spaces, coverage remains a challenge; manifold-based search and parametric priors are suggested as potential improvements.

Practical guidelines include ensuring inner-loop convergence before outer updates, slow outer learning rates, moderate exploration intervals, and bounded weight search. For sim-to-real transfer, fixing weights after initial randomization enhances robustness. Successor-feature architectures may further reduce performance drops when policy resets follow weight changes (Xie et al., 17 Dec 2025, Lu et al., 14 Sep 2025).

MORSE, through structured reward shaping and dynamic exploration in reward and policy space, constitutes a general and extensible paradigm for multi-objective RL across domains that require efficient trade-off management, robustness to heuristic misspecification, and theoretical soundness.