Vector Policy Optimization (VPO)

Updated 22 May 2026

Vector Policy Optimization (VPO) is a set of reinforcement learning algorithms that optimize diverse, multi-objective policies to approximate the Pareto frontier.
It leverages polyhedral analysis and vector linear programming to extract Pareto-efficient, deterministic policies through vertex enumeration.
Modern VPO employs stochastic scalarization and diversity-driven techniques to enhance test-time search in RL applications and large language model training.

Vector Policy Optimization (VPO) refers to a family of reinforcement learning (RL) and sequential decision-making algorithms in which policies are explicitly trained to optimize with respect to vector-valued (multi-objective) reward functions. Rather than collapsing multiple objectives into a single scalar via fixed linear weighting, VPO seeks to generate diverse solutions that collectively approximate the Pareto frontier in the reward space. This approach is particularly relevant in settings where candidate policies must support downstream search (e.g., best-of- $k$ or evolutionary selection), or must trade off among user-dependent reward criteria. The term encompasses both finite-horizon Markov decision process (MDP) formulations addressed with vector-valued linear programming and modern scalable policy optimization of LLMs using vector-valued rewards and stochastic scalarization mechanisms (Mifrani et al., 19 Feb 2025, Bahlous-Boldi et al., 21 May 2026).

1. Vector-Valued MDPs and Polyhedral Foundations

The prototypical vector-valued MDP (vMDP) framework defines a model with

state space $S = \{1,\ldots,N\}$ ,
at each state $s\in S$ a finite action set $A_s$ ,
finite time-horizon $H$ ,
time- and action-dependent vector rewards $r_t(s,a) \in \mathbb{R}^d$ for $t=1,\ldots,H-1$ as well as terminal rewards $r_H(s)\in\mathbb{R}^d$ ,
time-dependent transitions $P_t(s'|s,a)$ .

A policy $\pi$ specifies (possibly randomized) decisions at each epoch, and the core objective is to maximize the expected total vector reward in the Pareto sense:

$S = \{1,\ldots,N\}$ 0

where $S = \{1,\ldots,N\}$ 1.

The feasible set of achievable vectors coincides with the image of a polyhedron $S = \{1,\ldots,N\}$ 2 (state-action frequency constraints) under a linear map $S = \{1,\ldots,N\}$ 3 representing stacked reward vectors. Pareto-efficient policies correspond precisely to efficient (i.e., undominated) points of $S = \{1,\ldots,N\}$ 4, which in the regular case correspond one-to-one with the vertices of $S = \{1,\ldots,N\}$ 5 representing deterministic policies (Mifrani et al., 19 Feb 2025).

2. Vector Linear Programming and Enumeration

Vector Policy Optimization in this polyhedral context is cast as a Vector Linear Program (VLP):

$S = \{1,\ldots,N\}$ 6

where $S = \{1,\ldots,N\}$ 7 encodes state-action frequencies over the horizon and constraints encode probability mass conservation and transitions. Scalarizations via $S = \{1,\ldots,N\}$ 8 for $S = \{1,\ldots,N\}$ 9 identify Pareto-efficient solutions; efficient deterministic policies are precisely those $s\in S$ 0 lying at vertices of $s\in S$ 1 that are themselves Pareto-optimal (Mifrani et al., 19 Feb 2025).

An explicit adjacency-based enumeration algorithm deploys BFS pivots and an Evans–Steuer efficiency test to compute all efficient deterministic policies. Complexity is exponential in the number of state-action products, but in practice adjacency structure and pruning yield scalable enumeration for high-dimensional problems (e.g., system design with $s\in S$ 2– $s\in S$ 3 components per state) (Mifrani et al., 19 Feb 2025).

3. Stochastic Scalarization in Policy Optimization

Modern RL and LLM RLHF settings typically lack access to tabular MDP structure and instead operate in high-dimensional spaces (e.g., autoregressive generative models). The contemporary VPO approach (Bahlous-Boldi et al., 21 May 2026) conceptualizes the vector-reward RL objective as follows:

The policy $s\in S$ 4 is trained to emit sets $s\in S$ 5 of diverse outputs (e.g., completions, solutions).
Rollouts are evaluated via random scalarization: for each rollout, sample $s\in S$ 6 over the simplex and reward the best candidate in $s\in S$ 7 under $s\in S$ 8.
Formally, the VPO objective is

$s\in S$ 9

Iteratively maximizing this objective via policy gradients encourages the policy to generate outputs diversified along the Pareto frontier, rather than collapsing support onto one high-scoring region under a fixed $A_s$ 0.

VPO can be implemented as a direct replacement for the GRPO (Generalized Reward Policy Optimization) advantage estimator, broadcasting the set-level advantage to all tokens in a rollout. This maintains compatibility with established RLHF optimization pipelines.

4. Diversity, Test-Time Search, and Downstream Integration

A central insight underlying VPO is the recognition that in post-training RL for models deployed under test-time search procedures (e.g., best-of- $A_s$ 1 selection, evolutionary loops), diversity within the candidate pool becomes as crucial as the mean quality under any one reward combination. Generating diverse, high-quality solutions across the reward vector dimensions allows downstream search processes to extract Pareto-optimal or user-tuned candidates efficiently.

In empirical evaluations spanning constrained navigation, multi-hop QA, function-calling, and program synthesis, VPO-trained models outperform scalar RL baselines in pass@ $A_s$ 2 and best@ $A_s$ 3 metrics, with the gap increasing for larger search budgets $A_s$ 4. In regimes such as evolutionary search, VPO unlocks solution spaces that are inaccessible to scalar-trained baselines (Bahlous-Boldi et al., 21 May 2026).

5. Extensions: Behavior Control and Steering

Extensions such as Vector-Steered Policy Optimization (VSPO) focus on multi-objective behavioral control via latent steering vectors within the policy network (Zhang et al., 15 May 2026). Here, behavioral dimensions (e.g., verbosity, expertise) are captured by explicit latent directions $A_s$ 5. During each rollout, a steering coefficient $A_s$ 6 is applied to $A_s$ 7 to control the behavior intensity, and rollouts with varying $A_s$ 8 are aggregated for group-normalized advantage estimation.

In VSPO, the policy undergoes on-policy self-distillation, internalizing behavioral trade-offs along interpretable axes. Theoretical results under a bandit abstraction demonstrate that VSPO can accelerate convergence compared to vanilla reward shaping, provided the steering-induced probability shifts are sufficiently well-aligned with the behavior of interest. Empirical studies show significant gains in controllable attribute expression (e.g., expertise, confidence, robustness) without sacrificing core task accuracy.

6. Theoretical Analysis and Algorithmic Properties

The polyhedral vMDP perspective provides an explicit bijection between Pareto-efficient occupation measures and policy representations. The full polytopal structure enables access to LP duality, sensitivity analysis, and offers a foundation for robust and constrained extensions via standard linear programming tools (Mifrani et al., 19 Feb 2025).

In high-dimensional, black-box domains (e.g., LLM RL), VPO leverages stochastic scalarization and set-level rewards, directly encouraging solution diversity without requiring explicit entropy regularization. Theoretical insights link group-level advantage normalization and intra-group diversity incentives to improved scaling and search efficiency.

7. Limitations and Open Directions

VPO requires an explicit vector decomposition of the reward; benefits vanish in the scalar reward regime. Trade-offs between single-shot accuracy and multi-answer diversity mandate careful rollout and compute budget management.

Promising avenues for advancement include:

Adaptive or biased scalarization distributions for finer Pareto front coverage,
Integration with value-based critics and auxiliary diversity regularizers,
Combination of multiple steering vectors for control across several behavioral axes,
Extension to robust RLHF frameworks with multiple, possibly conflicting, human preference models (Mifrani et al., 19 Feb 2025, Bahlous-Boldi et al., 21 May 2026, Zhang et al., 15 May 2026).

VPO establishes a unified paradigm for RL in vector-valued settings, spanning from classic MDP polyhedral analysis to scalable, diversity-driven post-training of generative models.