Mixed On-Policy Reinforcement Learning (MORL)

Updated 7 June 2026

MORL is a hybrid on-policy RL framework that combines gradient-based policy updates with symbolic representations and constrained optimization to enhance interpretability and safety.
It augments traditional RL by incorporating program synthesis, multi-objective scalarization, and variance reduction techniques to improve sample efficiency and robustness.
The iterative process alternates between policy gradient updates and external interventions, ensuring that multiple objectives and high-level constraints are met effectively.

Mixed On-Policy Reinforcement Learning (MORL) encompasses a class of algorithms that extend on-policy reinforcement learning by combining standard policy gradient updates with alternative optimization or representation mechanisms, such as program synthesis, constrained optimization, multi-objective scalarization, and variance reduction. These frameworks are motivated by the need to address interpretability, safety, multiple objectives, and sample efficiency, which are inadequately handled by black-box policy networks trained solely with classical RL protocols.

1. Formulation and Core Objectives

Mixed On-Policy Reinforcement Learning is defined by its iterative optimization of policies in an on-policy manner (typically via TRPO, PPO, or actor–critic updates), augmented by auxiliary mechanisms that address one or more of the following objectives:

Enforcement of high-level constraints (such as safety, worst-case guarantees, or other human-driven performance criteria),
Interpretability improvement by leveraging symbolic, often programmatic or explicit, policy representations,
Sample efficiency increase through alternative optimization steps such as program repair, constraint optimization, or variance reduction,
Multi-objective optimization for policies aligned with diverse, often conflicting, objectives and user preferences.

The canonical MORL framework maintains dual (or mixed) representations and optimization channels, generally alternating between gradient-based policy learning in parameter space and external interventions—most commonly program repair, constrained optimization, scalarization updates, or synthesis steps (Bhupatiraju et al., 2018).

2. Iterative Mixed Optimization and Algorithmic Schemes

The original instantiation of MORL implements the following high-level loop (Bhupatiraju et al., 2018):

Policy Learning (On-Policy Gradient Update): Policy parameters $\theta$ are updated via standard on-policy trajectory sampling and policy gradient loss minimization, with optional regularization.
Policy Extraction or Synthesis: The policy $\pi_\theta$ is mapped to a symbolic representation $P$ (typically in a domain-specific language). This extraction is performed using program synthesis via imitation learning or distillation.
Repair or Intervention: The symbolic policy $P$ is modified—either automatically (using constraint solvers, SMT, CSP, genetic repair) or manually by human experts—to satisfy specified constraints, yielding a repaired program $P'$ .
Return to Policy Space (Imitation): The modified program is behavior cloned back into a policy $\phi$ , typically by supervised learning minimizing log-loss between $\phi$ and $P'$ . The new parameters are set as the working policy for the next iteration.

This process is iterated until performance or constraint satisfaction plateaus. Below is a table summarizing the representations and their roles:

Representation	Domain	Purpose
$\pi_\theta$ (neural)	Parameter space	On-policy learning (gradient updates)
$P$ (symbolic)	Program/DSL space	Inspection, interpretability, repair
$\pi_\theta$ 0 (behavior clone)	Parameter space	Policy after back-distillation

The procedure generalizes to mixed-optimization schemes in other domains, such as mixed-integer MPC (Gros et al., 2020), C-MORL (Liu et al., 2024), and various multi-objective or constrained MORL frameworks.

3. Methods for Handling Multiple Objectives and Constraints

Several MORL variants directly address mixed-objective optimization:

Linear Scalarization: Combines multiple objectives into a weighted sum, optimizing $\pi_\theta$ 1. This is widely used but limited by its inability to recover concave Pareto fronts and sensitivity to relative objective scaling (Abdolmaleki et al., 2021, Terekhov et al., 2024).
Pareto Ascent Directional Decomposition (PA2D-MORL): Computes a Pareto ascent direction by solving for a convex combination of policy gradients that improves all objectives simultaneously. This direction is used as the optimization step in on-policy updates, ensuring joint improvement across objectives and more uniform coverage of the Pareto front (Hu et al., 20 Mar 2026).
Constrained Optimization: C-MORL initializes multiple single-objective policies and employs constrained optimization steps to efficiently extend the discovered front, solving subproblems of the form $\pi_\theta$ 2 subject to $\pi_\theta$ 3 for all $\pi_\theta$ 4 and leveraging primal–dual or interior-point Lagrangian methods (Liu et al., 2024).
Mixture-of-Experts Distillation (DiME): Iteratively solves for optimal policies for each individual objective before distilling their mixture into a unified policy via (weighted) KL minimization, thus overcoming the scale-sensitivity and front-concavity limitations of scalarization (Abdolmaleki et al., 2021).

4. Practical Variants: Architectures, Losses, and Variance Reduction

Recent advances address stability, scalability, and robustness in on-policy MORL through architectural and algorithmic innovations:

Preference-Conditioned Networks: Model-free approaches such as MOPPO use parameterized policies and value functions $\pi_\theta$ 5 that take the scalarization weights $\pi_\theta$ 6 as input, allowing a single trained policy to represent the entire weight-conditioned Pareto set (Terekhov et al., 2024).
Adaptive Preference Sampling and Reward Normalization: Training procedures sample preference weights per rollout, support reward normalization (e.g., via PopArt), and control exploration entropy through dynamically-modulated penalization.
Variance-Reduced Policy Gradients: MO-TSIVR-PG extends the classic variance-reduction framework to multi-objective on-policy gradients by leveraging control variates, anchor policies, and importance weighting. This reduces gradient estimator variance, yielding improved sample efficiency and reduced scaling in the accuracy parameter $\pi_\theta$ 7 (Guidobene et al., 14 Aug 2025).

5. Empirical Evaluation and Benchmarks

Empirical results on classic and modern multi-objective benchmarks consistently demonstrate the efficacy of mixed on-policy optimization. Notable highlights across studies:

Efficiency and Coverage: C-MORL achieves leading hypervolume (HV) and expected utility (EU) measures across both discrete and continuous environments, including high-dimensional objective spaces (up to nine objectives) (Liu et al., 2024).
Front Density and Stability: PA2D-MORL achieves both higher HV and lower sparsity (SP) compared to evolutionary baselines, with smaller cross-seed variance (Hu et al., 20 Mar 2026).
Interpretability and Constraint Satisfaction: Program-synthesis-based MORL rapidly recovers interpretable, safe CartPole policies with faster convergence than direct neural policy optimization (Bhupatiraju et al., 2018).
Robust Preference Generalization: MORL-FB, integrating reward-free auxiliary objectives and preference-guided exploration, yields substantial gains in utility and hypervolume, excelling in worst-case preference (CVaR) tests (Chen et al., 27 Apr 2026).
Sample Complexity: MO-TSIVR-PG attains convergence rates of $\pi_\theta$ 8 versus the classical $\pi_\theta$ 9 scaling, with empirical $P$ 0–order scaling w.r.t. the number of objectives (Guidobene et al., 14 Aug 2025).

6. Theoretical Properties, Scalability, and Limitations

Optimization Guarantees: Strong duality and convergence are established for the constrained subproblems in C-MORL under Slater conditions; interior-point formulations yield explicit approximation error bounds (Liu et al., 2024). DiME’s EM-based updates improve a lower bound on the multi-objective likelihood at each iteration (Abdolmaleki et al., 2021).
Scalability: C-MORL’s complexity is linear in the number of objectives owing to its crowd-distance-based extension and parallel policy updating (Liu et al., 2024). Architectural approaches avoid explicit preference conditioning to mitigate curse-of-dimensionality effects (Liu et al., 2024).
Limitations: Scalarization-based approaches cannot recover non-convex portions of the Pareto front; methods relying on fixed scalarization functions are limited in generalizing over arbitrary user preferences. Variance reduction methods may require careful hyperparameter tuning.

7. Extensions and Future Directions

Hybrid Symbolic–Neural Methods: Chains of synthesis, repair, and policy space optimization may be further combined with multi-objective and constrained frameworks to leverage the respective strengths of interpretability and Pareto coverage.
Reward-Free Auxiliary Objectives: Reward-free RL paradigms supplement multi-objective learning, yielding enhanced generalization and robustness in high-objective regimes (Chen et al., 27 Apr 2026).
Adaptive and Nonlinear Scalarization: Emerging work motivates online adaptation of scalarization weights, mixture-based front recovery, and direct preference elicitation to better characterize the trade-off surface beyond what linear scalarization captures (Abdolmaleki et al., 2021, Hu et al., 20 Mar 2026).
Off-Policy Integration and Reuse: Extending variance reduction and constraint methods to off-policy settings, possibly combined with replay buffer techniques and trust-region regularization, remains an open research direction (Guidobene et al., 14 Aug 2025).

Mixed On-Policy Reinforcement Learning thus spans a spectrum of methodologies built atop on-policy RL, augmented by auxiliary optimization, programmatic repair, multi-objective or constraint-driven updates, and explicit variance control. These hybridized approaches substantially expand the applicability of RL in domains demanding interpretable, preference-aligned, robust, and sample-efficient decision making.