Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lexicographic Value Iteration

Updated 21 December 2025
  • Lexicographic Value Iteration is a dynamic programming method that solves multi-objective sequential decision problems by optimizing objectives in strict priority order.
  • The algorithm recursively applies Bellman-style updates with allowable action pruning, ensuring each layer’s decisions respect higher-priority constraints.
  • It guarantees convergence in finite MDPs and supports applications in controller synthesis, multi-objective reinforcement learning, and stochastic games.

A lexicographic value iteration algorithm is a dynamic programming method for solving sequential decision problems in which multiple objectives are ranked in a strict priority order and optimized lexicographically—i.e., the first objective is maximized, then, among all optimal solutions for the first, the second is optimized, and so on. Such algorithms have been developed for diverse settings: Markov decision processes (MDPs), stochastic games, and stochastic shortest-path (SSP) problems, accommodating arbitrary arrangements of reachability, safety, cost, or reward objectives. Applications include the synthesis and verification of controllers under multiple competing requirements, specification-constrained planning, and multi-objective reinforcement learning. The key technical feature is a recursive reduction to a sequence of single-objective Bellman-style updates, each subject to admissibility constraints determined by higher-priority objectives (Skalse et al., 2022, Chatterjee et al., 2020, Zhang et al., 14 Dec 2025).

1. Formal Models and Problem Statement

In the lexicographic value iteration context, the environment is modeled as an MDP or a turn-based stochastic game:

  • State and action space: SS (finite), AA (finite); for games, states partitioned into SmaxS_{\max}, SminS_{\min}.
  • Transition function: P(ss,a)P(s'|s,a) (probabilities) or T(s,a,s)T(s,a,s').
  • Objective vector: (Φ1,Φ2,...,Φn)(\Phi_1, \Phi_2, ..., \Phi_n), with Φi\Phi_i typically a reachability, safety, or cost/reward function. For cost-based problems, objectives may be sum- or max-aggregated over trajectories (Zhang et al., 14 Dec 2025).
  • Lexicographic ordering: For vectors x,yRnx,y\in\mathbb{R}^n, x<lexyx<_{\rm lex} y if there exists kk such that xi=yix_i = y_i for i<ki < k and xk<ykx_k < y_k.

The canonical goal is to compute an (approximately) lexicographically optimal policy π\pi^* such that, for all policies π\pi, the expected cumulative objective vector vπ(s)v^{\pi^*}(s) is lex-preferred over vπ(s)v^\pi(s), for every sSs\in S (Skalse et al., 2022).

2. Lexicographic Bellman Equations and Operators

The central construct in lexicographic value iteration is the nested Bellman update, recursively defined over layers:

  • Allowable action sets: For each state ss and layer ii, recursively define

Δ0(s)=AΔi+1(s)=arg maxaΔi(s)Qi(s,a)\Delta^0(s) = A \qquad \Delta^{i+1}(s) = \argmax_{a\in\Delta^i(s)} Q_i(s,a)

  • Layered Bellman equations: For i=1,...,ni=1,...,n,

Qi(s,a)=Ri(s,a)+γiEs[maxaΔi1(s)Qi(s,a)]Q_i^*(s,a) = R_i(s,a) + \gamma_i\,\mathbb{E}_{s'} [\max_{a'\in\Delta^{i-1}(s')} Q_i^*(s',a') ]

with maximization or minimization taken over admissible actions for layer ii, as determined by the previous layers (Skalse et al., 2022).

In cost-minimizing contexts or stochastic shortest-path setups, the Bellman equations extend to accommodate sum- and max-aggregation objectives in the recursion, requiring state augmentation for max-objectives to retain the Markov property (Zhang et al., 14 Dec 2025).

3. Algorithmic Structure and Pseudocode

A high-level outline of the lexicographic value iteration procedure is as follows:

  1. Initialization: Set Qi0(s,a)0Q_i^0(s,a) \gets 0 (or large values for costs).
  2. Outer iteration: For k=0,1,...k=0,1,...
    • Set Δk0(s)=A\Delta^0_k(s) = A.
    • For i=1,,ni=1,\dots, n (objectives in order of priority):
      • For each (s,a)(s,a), compute Qik+1(s,a)Q_i^{k+1}(s,a) using Bellman update restricted to actions in Δki1(s)\Delta^{i-1}_k(s).
      • For each ss, set Δki(s)=arg maxaΔki1(s)Qik+1(s,a)\Delta^i_k(s) = \argmax_{a\in\Delta^{i-1}_k(s)} Q_i^{k+1}(s,a).
    • Terminate if the maximum norm difference is below a set tolerance.

For stochastic games with reachability and safety objectives, the algorithm proceeds by sequentially solving single-objective subgames, each time restricting to actions that achieve optimal value for higher-priority objectives. For non-absorbing targets, dynamic programming over all 2n12^n-1 non-empty subsets of objectives is used to stitch together finite-memory strategies (Chatterjee et al., 2020).

For mixed max-sum SSP problems, each max-objective requires augmentation of the state space with the running maximum cost, and Bellman updates are carried out in this expanded space, layer by layer (Zhang et al., 14 Dec 2025).

4. Theoretical Guarantees and Complexity

Lexicographic value iteration enjoys the following properties:

  • Convergence: In finite MDPs or games with bounded rewards/costs and discount factors <1<1, the algorithm converges in finitely many sweeps to the unique lexicographic optimum. Each Bellman operator in the lexicographic sequence is a contraction or, in finite-horizon formulations, monotone and bounded (Skalse et al., 2022, Zhang et al., 14 Dec 2025).
  • Memory requirements: For stochastic games, a strategy composed via the algorithm requires at most 2n12^n-1 memory classes, i.e., nn bits, which is tight (Chatterjee et al., 2020).
  • Decision complexity: For constant nn objectives, lexicographic decision problems are in NP \cap coNP; in general they are PSPACE-hard, and solvable in NEXPTIME \cap coNEXPTIME. For MDPs with multiple reachability objectives, the complexity is already PSPACE-hard (Chatterjee et al., 2020).
  • Scalability: In dense MDPs, each sweep costs O(mn2A)O(mn^2|A|); for product MDPs arising from LTL specifications, the state space and sweep costs depend on automaton size, discretization level, and horizon (Skalse et al., 2022, Zhang et al., 14 Dec 2025).

5. Illustrative Examples

The following examples concretize lexicographic value iteration:

  • Two-objective MDP: Suppose S={s1,s2},A={a1,a2}S = \{s_1, s_2\}, A = \{a_1, a_2\}, rewards R(s1,a1)=(2,1)R(s_1,a_1) = (2,1), R(s1,a2)=(1,10)R(s_1,a_2) = (1,10), and discount factors $0.9$. The algorithm locks in a1a_1 at s1s_1 for objective 1, and only consults objective 2 among actions remaining optimal for objective 1, avoiding scalarization (Skalse et al., 2022).
  • Gridworld with bottleneck and sum costs: For a 5×55\times5 grid with high-cost “bottleneck” cells, the algorithm finds policies that avoid catastrophic costs under a max-objective, and subsequently minimize summed path length over detours, in contrast to standard value iteration which may accept large individual costs to minimize mean cost (Zhang et al., 14 Dec 2025).

6. Integration with Temporal Logic and Controller Synthesis

Lexicographic value iteration extends to rich specification domains:

  • LTL/LTLf Specifications: Automata for linear temporal logic (LTL) specifications are composed with the base MDP to form a product MDP; then lexicographic value iteration is deployed on this structure. This synthetizes policies guaranteed to satisfy complex temporal constraints in prioritized fashion (Zhang et al., 14 Dec 2025).
  • Stochastic games for verification/synthesis: In controller synthesis and verification, lexicographic reach-safety objectives directly encode multi-tiered requirements (e.g., safety-critical followed by performance), with guarantees of determinacy and memory bounds (Chatterjee et al., 2020).

7. Implementation and Experiments

Prototype implementations are reported using both explicit dynamic programming and integration into model-checking frameworks. For instance, Chatterjee et al. implemented the approach in PRISM-games, using standard value-iteration as a single-objective engine. Experimental results on structured benchmarks with up to 1.4×1061.4\times 10^6 states report small constant-factor overheads over single-objective runs. For mixed max-sum-SSP with LTL constraints, careful state indexing, discretization, and horizon selection are recommended, with explicit notes on parallelization and numerical stability (Chatterjee et al., 2020, Zhang et al., 14 Dec 2025). In practice, the action pruning from higher-priority objectives reduces the effective search space.

Paper Title Setting Notable Features/Advances
Stochastic Games with Lexicographic Reachability-Safety Objectives (Chatterjee et al., 2020) Stochastic games Lex priorities in reach/safety; memory, comp.
Lexicographic Multi-Objective RL (Skalse et al., 2022) Multi-objective RL, MDPs RL algorithms; convergence
Lexicographic Multi-Objective SSP w/ Mixed Max-Sum Costs (Zhang et al., 14 Dec 2025) SSP, LTL constraints SSP + bottleneck costs, product MDPs

This table juxtaposes major lexicographic VI research papers, their domains, and core algorithmic contributions.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lexicographic Value Iteration Algorithm.