Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vector Policy Optimization (VPO)

Updated 22 May 2026
  • Vector Policy Optimization (VPO) is a set of reinforcement learning algorithms that optimize diverse, multi-objective policies to approximate the Pareto frontier.
  • It leverages polyhedral analysis and vector linear programming to extract Pareto-efficient, deterministic policies through vertex enumeration.
  • Modern VPO employs stochastic scalarization and diversity-driven techniques to enhance test-time search in RL applications and large language model training.

Vector Policy Optimization (VPO) refers to a family of reinforcement learning (RL) and sequential decision-making algorithms in which policies are explicitly trained to optimize with respect to vector-valued (multi-objective) reward functions. Rather than collapsing multiple objectives into a single scalar via fixed linear weighting, VPO seeks to generate diverse solutions that collectively approximate the Pareto frontier in the reward space. This approach is particularly relevant in settings where candidate policies must support downstream search (e.g., best-of-kk or evolutionary selection), or must trade off among user-dependent reward criteria. The term encompasses both finite-horizon Markov decision process (MDP) formulations addressed with vector-valued linear programming and modern scalable policy optimization of LLMs using vector-valued rewards and stochastic scalarization mechanisms (Mifrani et al., 19 Feb 2025, Bahlous-Boldi et al., 21 May 2026).

1. Vector-Valued MDPs and Polyhedral Foundations

The prototypical vector-valued MDP (vMDP) framework defines a model with

  • state space S={1,…,N}S = \{1,\ldots,N\},
  • at each state s∈Ss\in S a finite action set AsA_s,
  • finite time-horizon HH,
  • time- and action-dependent vector rewards rt(s,a)∈Rdr_t(s,a) \in \mathbb{R}^d for t=1,…,H−1t=1,\ldots,H-1 as well as terminal rewards rH(s)∈Rdr_H(s)\in\mathbb{R}^d,
  • time-dependent transitions Pt(s′∣s,a)P_t(s'|s,a).

A policy π\pi specifies (possibly randomized) decisions at each epoch, and the core objective is to maximize the expected total vector reward in the Pareto sense:

S={1,…,N}S = \{1,\ldots,N\}0

where S={1,…,N}S = \{1,\ldots,N\}1.

The feasible set of achievable vectors coincides with the image of a polyhedron S={1,…,N}S = \{1,\ldots,N\}2 (state-action frequency constraints) under a linear map S={1,…,N}S = \{1,\ldots,N\}3 representing stacked reward vectors. Pareto-efficient policies correspond precisely to efficient (i.e., undominated) points of S={1,…,N}S = \{1,\ldots,N\}4, which in the regular case correspond one-to-one with the vertices of S={1,…,N}S = \{1,\ldots,N\}5 representing deterministic policies (Mifrani et al., 19 Feb 2025).

2. Vector Linear Programming and Enumeration

Vector Policy Optimization in this polyhedral context is cast as a Vector Linear Program (VLP):

S={1,…,N}S = \{1,\ldots,N\}6

where S={1,…,N}S = \{1,\ldots,N\}7 encodes state-action frequencies over the horizon and constraints encode probability mass conservation and transitions. Scalarizations via S={1,…,N}S = \{1,\ldots,N\}8 for S={1,…,N}S = \{1,\ldots,N\}9 identify Pareto-efficient solutions; efficient deterministic policies are precisely those s∈Ss\in S0 lying at vertices of s∈Ss\in S1 that are themselves Pareto-optimal (Mifrani et al., 19 Feb 2025).

An explicit adjacency-based enumeration algorithm deploys BFS pivots and an Evans–Steuer efficiency test to compute all efficient deterministic policies. Complexity is exponential in the number of state-action products, but in practice adjacency structure and pruning yield scalable enumeration for high-dimensional problems (e.g., system design with s∈Ss\in S2–s∈Ss\in S3 components per state) (Mifrani et al., 19 Feb 2025).

3. Stochastic Scalarization in Policy Optimization

Modern RL and LLM RLHF settings typically lack access to tabular MDP structure and instead operate in high-dimensional spaces (e.g., autoregressive generative models). The contemporary VPO approach (Bahlous-Boldi et al., 21 May 2026) conceptualizes the vector-reward RL objective as follows:

  • The policy s∈Ss\in S4 is trained to emit sets s∈Ss\in S5 of diverse outputs (e.g., completions, solutions).
  • Rollouts are evaluated via random scalarization: for each rollout, sample s∈Ss\in S6 over the simplex and reward the best candidate in s∈Ss\in S7 under s∈Ss\in S8.
  • Formally, the VPO objective is

s∈Ss\in S9

Iteratively maximizing this objective via policy gradients encourages the policy to generate outputs diversified along the Pareto frontier, rather than collapsing support onto one high-scoring region under a fixed AsA_s0.

VPO can be implemented as a direct replacement for the GRPO (Generalized Reward Policy Optimization) advantage estimator, broadcasting the set-level advantage to all tokens in a rollout. This maintains compatibility with established RLHF optimization pipelines.

4. Diversity, Test-Time Search, and Downstream Integration

A central insight underlying VPO is the recognition that in post-training RL for models deployed under test-time search procedures (e.g., best-of-AsA_s1 selection, evolutionary loops), diversity within the candidate pool becomes as crucial as the mean quality under any one reward combination. Generating diverse, high-quality solutions across the reward vector dimensions allows downstream search processes to extract Pareto-optimal or user-tuned candidates efficiently.

In empirical evaluations spanning constrained navigation, multi-hop QA, function-calling, and program synthesis, VPO-trained models outperform scalar RL baselines in pass@AsA_s2 and best@AsA_s3 metrics, with the gap increasing for larger search budgets AsA_s4. In regimes such as evolutionary search, VPO unlocks solution spaces that are inaccessible to scalar-trained baselines (Bahlous-Boldi et al., 21 May 2026).

5. Extensions: Behavior Control and Steering

Extensions such as Vector-Steered Policy Optimization (VSPO) focus on multi-objective behavioral control via latent steering vectors within the policy network (Zhang et al., 15 May 2026). Here, behavioral dimensions (e.g., verbosity, expertise) are captured by explicit latent directions AsA_s5. During each rollout, a steering coefficient AsA_s6 is applied to AsA_s7 to control the behavior intensity, and rollouts with varying AsA_s8 are aggregated for group-normalized advantage estimation.

In VSPO, the policy undergoes on-policy self-distillation, internalizing behavioral trade-offs along interpretable axes. Theoretical results under a bandit abstraction demonstrate that VSPO can accelerate convergence compared to vanilla reward shaping, provided the steering-induced probability shifts are sufficiently well-aligned with the behavior of interest. Empirical studies show significant gains in controllable attribute expression (e.g., expertise, confidence, robustness) without sacrificing core task accuracy.

6. Theoretical Analysis and Algorithmic Properties

The polyhedral vMDP perspective provides an explicit bijection between Pareto-efficient occupation measures and policy representations. The full polytopal structure enables access to LP duality, sensitivity analysis, and offers a foundation for robust and constrained extensions via standard linear programming tools (Mifrani et al., 19 Feb 2025).

In high-dimensional, black-box domains (e.g., LLM RL), VPO leverages stochastic scalarization and set-level rewards, directly encouraging solution diversity without requiring explicit entropy regularization. Theoretical insights link group-level advantage normalization and intra-group diversity incentives to improved scaling and search efficiency.

7. Limitations and Open Directions

VPO requires an explicit vector decomposition of the reward; benefits vanish in the scalar reward regime. Trade-offs between single-shot accuracy and multi-answer diversity mandate careful rollout and compute budget management.

Promising avenues for advancement include:

VPO establishes a unified paradigm for RL in vector-valued settings, spanning from classic MDP polyhedral analysis to scalable, diversity-driven post-training of generative models.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vector Policy Optimization (VPO).