Pareto Set Learning (PSL-MORL)

Updated 9 March 2026

Pareto Set Learning (PSL-MORL) is a framework that conditions policies on user-specified preference vectors to densely approximate Pareto-optimal fronts in multi-objective decision-making.
Hypernetwork-based architectures combine global anchor parameters with preference-conditioned weights, ensuring stability and continuous interpolation across high-dimensional policy spaces.
Evolutionary and adaptive preference sampling strategies enhance front coverage and accelerate convergence, effectively addressing non-convex and disconnected Pareto regions.

Pareto Set Learning (PSL-MORL) refers to a class of multi-objective reinforcement learning (MORL) algorithms that aim to generate a continuous or dense representation of the Pareto-optimal policy set, allowing for controllable trade-offs between competing objectives by conditioning policies on user-specified preference vectors. PSL-MORL leverages neural parameterizations, preference-conditioned hypernetworks, and evolutionary or adaptive preference sampling strategies to address the challenges of coverage, diversity, and efficiency in approximating complex Pareto fronts in high-dimensional sequential decision-making problems. This entry surveys the mathematical foundations, principal algorithmic approaches, scalarization and preference sampling schemes, theoretical guarantees, and key experimental results in PSL-MORL.

1. Mathematical Formulation of PSL-MORL

PSL-MORL operates in the context of a multi-objective Markov Decision Process (MOMDP) defined as $\mathcal{M} = (S, A, P, \mathbf{R}, \gamma)$ , where $S$ is the (possibly continuous) state space, $A$ is the action space, $P$ is the transition kernel, and $\mathbf{R}(s,a,s') = [r_1, ..., r_m]^\top$ is the $m$ -dimensional vector reward function. For a stochastic policy $\pi_\theta(a|s)$ parameterized by $\theta \in \mathbb R^n$ , the vector of expected discounted returns is

$\mathbf J^\pi = [J_1^\pi, ..., J_m^\pi]^\top, \quad J_i^\pi = \mathbb{E}_{\tau \sim \pi}\Big[\sum_{t=0}^\infty \gamma^t r_i(s_t, a_t, s_{t+1})\Big]$

The Pareto front is defined in objective space as the image of the set of Pareto-optimal policies, i.e., those not strictly dominated on all objectives.

A central idea is to introduce a preference (weight) vector $\omega \in \Delta^{m-1}$ (the $(m-1)$ -simplex), and consider scalarized objectives $J(\pi, \omega) = \omega^\top \mathbf J^\pi$ . Classical scalarization reduces multi-objective optimization to scalar single-objective learning for each $\omega$ , but naive approaches either require retraining per $\omega$ or compromise front coverage and diversity.

In PSL-MORL, the goal is to learn a parameterized mapping from $\omega$ (or analogous preference representations) to policies $\pi_{\theta(\omega)}$ such that the resulting image covers the Pareto set densely, supporting user-controllable trade-offs (Liu et al., 12 Jan 2025). Variants employ hypernetworks, meta-learning, or universal function approximators to share learning across preference conditionings.

2. Hypernetwork-Based PSL Architectures

A widely adopted paradigm in PSL-MORL employs a hypernetwork $H_\phi: \omega \mapsto \theta_2$ , which generates policy parameters for each preference vector. These are blended with global anchor parameters $\theta_1$ by parameter fusion (e.g., $\theta = (1-\alpha)\theta_1 + \alpha\theta_2$ ), supporting both stability and preference adaptation (Liu et al., 12 Jan 2025, Shu et al., 2024). The policy $\pi_{\theta(\omega)}$ can thus be efficiently instantiated for any $\omega$ .

The training objective is to minimize the expected scalarized RL loss over the preference simplex:

$\min_{\phi, \theta_1} \mathbb{E}_{\omega \sim U(\Delta^{m-1})}\Big[ L_\mathrm{RL}(\pi_{(1-\alpha)\theta_1 + \alpha H_\phi(\omega)}, \omega) \Big]$

where $L_\mathrm{RL}$ is the loss for an off-the-shelf RL algorithm (e.g., PPO, DDPG, SAC) applied to the scalarized objective (Liu et al., 12 Jan 2025, Shu et al., 2024).

In pure hypernetwork parameterizations for continuous control, the PSL-MORL approach induces a low-dimensional manifold of policy parameters in $\mathbb{R}^n$ . Empirical studies show that the learned set of policy weights for different $\omega$ lies on a smooth $(m-1)$ -dimensional surface, supporting efficient representation and continuous coverage of the front (Shu et al., 2024).

3. Preference Sampling: Evolutionary and Adaptive Strategies

Uniform preference sampling may lead to inefficient model utilization and poor Pareto front coverage, especially on disconnected or degenerate fronts. Evolutionary Preference Sampling (EPS) addresses this via an evolutionary loop where preference populations are evolved via simulated binary crossover and polynomial mutation, with fitness assigned by non-dominated sorting and crowding distance of policy performance in objective space (Ye et al., 2024). Key stages:

Training is segmented into fixed periods (e.g., $T$ iterations).
At each period's end, the elite subset of preferences (according to approximated Pareto status and coverage) forms the new parent pool.
SBX crossover and mutation are applied to parents to generate the next batch of preferences, projected onto the simplex.
This adaptive sampling concentrates training on preference regions corresponding to both well-converged and underexplored Pareto front segments.

EPS can be plugged into any PSL-MORL approach, with all other components (network and loss) unchanged. Experiments show that EPS reduces convergence time and improves HV on challenging fronts (Ye et al., 2024).

Gaussian Splatting methods further partition the preference space, with local experts learning within Gaussian regions and an aggregator MLP fusing outputs, dynamically controlling partitioning so as to allocate representational capacity to complex front structures (Dinh et al., 22 Sep 2025).

4. Scalarization Schemes and Front Coverage

Scalarization converts the vector-valued MORL objective into a family of scalar objectives parameterized by the user or algorithm. Widely used forms include:

Linear scalarization: $s_{ls}(x|\lambda) = \sum_i \lambda_i f_i(x)$ ;
Tchebycheff and variants: $s_{tch}(x|\lambda) = \max_i \lambda_i (f_i(x) - (z^*_i - \epsilon))$ , or normalized variants to enhance non-convexity sensitivity;
Hypervolume-based loss: Promotes outward expansion of the Pareto front (Ye et al., 2024, Dinh et al., 22 Sep 2025).

Tchebycheff scalarization is favored for guaranteed Pareto coverage: every $\lambda > 0$ yields a weak Pareto-optimal policy, and restricting $\lambda$ to the interior of the simplex covers the entire Pareto set, including nonconvex regions. Smooth Tchebycheff scalarization provides better discrimination and sampling properties (Qiu et al., 2024).

When non-linear or non-convex fronts are targeted (e.g., via Tchebycheff or hypervolume-gradient methods), PSL-MORL can approximate even disconnected or degenerate Pareto frontiers, provided the preference sampling scheme is sufficiently adaptive and the parameterization sufficiently expressive (Ye et al., 2024, Dinh et al., 22 Sep 2025).

5. Theoretical Guarantees and Sample Complexity

Convergence and optimality guarantees for PSL-MORL derive from fixed-point results and the expressive capacity of the hypernetwork parameterization (Liu et al., 12 Jan 2025). Under contraction mappings for Bellman-like operators in multi-objective Q-functions and regularity assumptions, hypernetwork-generated policies converge to Pareto-optimal solutions for each $\omega$ .

Model capacity results based on Rademacher complexity confirm that PSL-MORL's hypernetwork-based policy class strictly dominates the capacity of universal preference-conditioned networks, enabling more personalized and distinct policies across preferences (Liu et al., 12 Jan 2025).

For weighted Chebyshev actor-critic approaches, finite-time sample complexity can be bounded as $\tilde{\mathcal{O}}(\epsilon^{-2})$ per preference, provided learning rates and batch sizes are tuned appropriately, even for non-convex fronts (Hairi et al., 29 Jul 2025). Preference-free exploration schemes enable one-time environment traversal with $\tilde{\mathcal{O}}(\epsilon^{-2})$ complexity, after which solutions for any number of preferences can be recovered offline with no further exploration (Qiu et al., 2024). Smooth scalarization schemes further improve convergence properties in the mirror-ascent steps (Qiu et al., 2024).

6. Empirical Results and Benchmarks

Extensive experiments compare PSL-MORL and its evolutionary and hypernetwork-based instantiations to state-of-the-art MORL algorithms on:

Synthetic multicriteria optimization problems with convex, non-convex, or disconnected Pareto fronts (e.g., ZDT3, DTLZ7, DTLZ5, RE33, RE36, RE37, RE21) (Ye et al., 2024, Dinh et al., 22 Sep 2025);
Continuous-control MuJoCo and robotics environments (HalfCheetah, Ant, Hopper, Swimmer, Walker, Humanoid) (Liu et al., 12 Jan 2025, Shu et al., 2024, Zhu et al., 2023);
Discrete and high-objective-count benchmarks (Fruit Tree Navigation, FTN) (Liu et al., 12 Jan 2025);
Policy control in multi-objective pandemic mitigation (Chen et al., 2 Oct 2025).

Key findings:

PSL-MORL consistently achieves higher hypervolume and denser front coverage (lower sparsity) than universal or separate-policy baselines (Liu et al., 12 Jan 2025, Shu et al., 2024).
Hypernetworks offer substantial reductions in parameter count and allow for continuous front interpolation.
EPS and Gaussian Splatting methods accelerate convergence and improve coverage on non-convex/discontinuous fronts (Ye et al., 2024, Dinh et al., 22 Sep 2025).
PSL-MORL shows strong scalability to high-dimensional state/action spaces and many-objective tasks (up to $m=6$ in FTN).

7. Limitations and Future Directions

Current PSL-MORL frameworks encounter challenges including:

Difficulty in exact coverage of highly nonconvex fronts with only linear scalarization; non-linear or Chebyshev-type scalarizations, or hypervolume maximization, are promising but computationally more demanding (Liu et al., 12 Jan 2025, Ye et al., 2024, Nguyen et al., 7 Jun 2025).
Preference sampling in very high-dimensional preference spaces may require more sophisticated adaptive or dependence-regularized methods (Ye et al., 2024, Dinh et al., 22 Sep 2025).
Extending hypernetwork-based PSL-MORL to off-policy, offline, or partially observed settings remains an open area (Zhu et al., 2023).
Theoretical sample complexity for deep (nonlinear) function approximators is largely unproven, though empirical results are favorable (Hairi et al., 29 Jul 2025).

Open problems include integrating constraint handling, robust and adaptive scalarization, efficient adaptation to online/dynamic changes in objective priorities, and learning from noisy or human-in-the-loop preference queries (Mu et al., 18 Jul 2025, Liu et al., 12 Jan 2025, Qiu et al., 2024).

References

"Pareto Set Learning for Multi-Objective Reinforcement Learning" (Liu et al., 12 Jan 2025)
"Learning Pareto Set for Multi-Objective Continuous Robot Control" (Shu et al., 2024)
"Evolutionary Preference Sampling for Pareto Set Learning" (Ye et al., 2024)
"GaussianPSL: A novel framework based on Gaussian Splatting..." (Dinh et al., 22 Sep 2025)
"Enabling Pareto-Stationarity Exploration in Multi-Objective Reinforcement Learning..." (Hairi et al., 29 Jul 2025)
"Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning" (Qiu et al., 2024)
"Scaling Pareto-Efficient Decision Making Via Offline Multi-Objective RL" (Zhu et al., 2023)
"Preference-based Multi-Objective Reinforcement Learning" (Mu et al., 18 Jul 2025)
"A Framework for Controllable Multi-objective Learning with Annealed Stein Variational Hypernetworks" (Nguyen et al., 7 Jun 2025)