Lexicographic Value Iteration

Updated 21 December 2025

Lexicographic Value Iteration is a dynamic programming method that solves multi-objective sequential decision problems by optimizing objectives in strict priority order.
The algorithm recursively applies Bellman-style updates with allowable action pruning, ensuring each layer’s decisions respect higher-priority constraints.
It guarantees convergence in finite MDPs and supports applications in controller synthesis, multi-objective reinforcement learning, and stochastic games.

A lexicographic value iteration algorithm is a dynamic programming method for solving sequential decision problems in which multiple objectives are ranked in a strict priority order and optimized lexicographically—i.e., the first objective is maximized, then, among all optimal solutions for the first, the second is optimized, and so on. Such algorithms have been developed for diverse settings: Markov decision processes (MDPs), stochastic games, and stochastic shortest-path (SSP) problems, accommodating arbitrary arrangements of reachability, safety, cost, or reward objectives. Applications include the synthesis and verification of controllers under multiple competing requirements, specification-constrained planning, and multi-objective reinforcement learning. The key technical feature is a recursive reduction to a sequence of single-objective Bellman-style updates, each subject to admissibility constraints determined by higher-priority objectives (Skalse et al., 2022, Chatterjee et al., 2020, Zhang et al., 14 Dec 2025).

1. Formal Models and Problem Statement

In the lexicographic value iteration context, the environment is modeled as an MDP or a turn-based stochastic game:

State and action space: $S$ (finite), $A$ (finite); for games, states partitioned into $S_{\max}$ , $S_{\min}$ .
Transition function: $P(s'|s,a)$ (probabilities) or $T(s,a,s')$ .
Objective vector: $(\Phi_1, \Phi_2, ..., \Phi_n)$ , with $\Phi_i$ typically a reachability, safety, or cost/reward function. For cost-based problems, objectives may be sum- or max-aggregated over trajectories (Zhang et al., 14 Dec 2025).
Lexicographic ordering: For vectors $x,y\in\mathbb{R}^n$ , $x<_{\rm lex} y$ if there exists $k$ such that $x_i = y_i$ for $i < k$ and $x_k < y_k$ .

The canonical goal is to compute an (approximately) lexicographically optimal policy $\pi^*$ such that, for all policies $\pi$ , the expected cumulative objective vector $v^{\pi^*}(s)$ is lex-preferred over $v^\pi(s)$ , for every $s\in S$ (Skalse et al., 2022).

2. Lexicographic Bellman Equations and Operators

The central construct in lexicographic value iteration is the nested Bellman update, recursively defined over layers:

Allowable action sets: For each state $s$ and layer $i$ , recursively define

$\Delta^0(s) = A \qquad \Delta^{i+1}(s) = \argmax_{a\in\Delta^i(s)} Q_i(s,a)$

Layered Bellman equations: For $i=1,...,n$ ,

$Q_i^*(s,a) = R_i(s,a) + \gamma_i\,\mathbb{E}_{s'} [\max_{a'\in\Delta^{i-1}(s')} Q_i^*(s',a') ]$

with maximization or minimization taken over admissible actions for layer $i$ , as determined by the previous layers (Skalse et al., 2022).

In cost-minimizing contexts or stochastic shortest-path setups, the Bellman equations extend to accommodate sum- and max-aggregation objectives in the recursion, requiring state augmentation for max-objectives to retain the Markov property (Zhang et al., 14 Dec 2025).

3. Algorithmic Structure and Pseudocode

A high-level outline of the lexicographic value iteration procedure is as follows:

Initialization: Set $Q_i^0(s,a) \gets 0$ (or large values for costs).
Outer iteration: For $k=0,1,...$ $k = 0, 1, ...$
- Set $\Delta^0_k(s) = A$ .
- For $i=1,\dots, n$ $i = 1, \dots, n$ (objectives in order of priority):
  - For each $(s,a)$ , compute $Q_i^{k+1}(s,a)$ using Bellman update restricted to actions in $\Delta^{i-1}_k(s)$ .
  - For each $s$ , set $\Delta^i_k(s) = \argmax_{a\in\Delta^{i-1}_k(s)} Q_i^{k+1}(s,a)$ .
- Terminate if the maximum norm difference is below a set tolerance.

For stochastic games with reachability and safety objectives, the algorithm proceeds by sequentially solving single-objective subgames, each time restricting to actions that achieve optimal value for higher-priority objectives. For non-absorbing targets, dynamic programming over all $2^n-1$ non-empty subsets of objectives is used to stitch together finite-memory strategies (Chatterjee et al., 2020).

For mixed max-sum SSP problems, each max-objective requires augmentation of the state space with the running maximum cost, and Bellman updates are carried out in this expanded space, layer by layer (Zhang et al., 14 Dec 2025).

4. Theoretical Guarantees and Complexity

Lexicographic value iteration enjoys the following properties:

Convergence: In finite MDPs or games with bounded rewards/costs and discount factors $<1$ , the algorithm converges in finitely many sweeps to the unique lexicographic optimum. Each Bellman operator in the lexicographic sequence is a contraction or, in finite-horizon formulations, monotone and bounded (Skalse et al., 2022, Zhang et al., 14 Dec 2025).
Memory requirements: For stochastic games, a strategy composed via the algorithm requires at most $2^n-1$ memory classes, i.e., $n$ bits, which is tight (Chatterjee et al., 2020).
Decision complexity: For constant $n$ objectives, lexicographic decision problems are in NP $\cap$ coNP; in general they are PSPACE-hard, and solvable in NEXPTIME $\cap$ coNEXPTIME. For MDPs with multiple reachability objectives, the complexity is already PSPACE-hard (Chatterjee et al., 2020).
Scalability: In dense MDPs, each sweep costs $O(mn^2|A|)$ ; for product MDPs arising from LTL specifications, the state space and sweep costs depend on automaton size, discretization level, and horizon (Skalse et al., 2022, Zhang et al., 14 Dec 2025).

5. Illustrative Examples

The following examples concretize lexicographic value iteration:

Two-objective MDP: Suppose $S = \{s_1, s_2\}, A = \{a_1, a_2\}$ , rewards $R(s_1,a_1) = (2,1)$ , $R(s_1,a_2) = (1,10)$ , and discount factors $0.9$. The algorithm locks in $a_1$ at $s_1$ for objective 1, and only consults objective 2 among actions remaining optimal for objective 1, avoiding scalarization (Skalse et al., 2022).
Gridworld with bottleneck and sum costs: For a $5\times5$ grid with high-cost “bottleneck” cells, the algorithm finds policies that avoid catastrophic costs under a max-objective, and subsequently minimize summed path length over detours, in contrast to standard value iteration which may accept large individual costs to minimize mean cost (Zhang et al., 14 Dec 2025).

6. Integration with Temporal Logic and Controller Synthesis

Lexicographic value iteration extends to rich specification domains:

LTL/LTLf Specifications: Automata for linear temporal logic (LTL) specifications are composed with the base MDP to form a product MDP; then lexicographic value iteration is deployed on this structure. This synthetizes policies guaranteed to satisfy complex temporal constraints in prioritized fashion (Zhang et al., 14 Dec 2025).
Stochastic games for verification/synthesis: In controller synthesis and verification, lexicographic reach-safety objectives directly encode multi-tiered requirements (e.g., safety-critical followed by performance), with guarantees of determinacy and memory bounds (Chatterjee et al., 2020).

7. Implementation and Experiments

Prototype implementations are reported using both explicit dynamic programming and integration into model-checking frameworks. For instance, Chatterjee et al. implemented the approach in PRISM-games, using standard value-iteration as a single-objective engine. Experimental results on structured benchmarks with up to $1.4\times 10^6$ states report small constant-factor overheads over single-objective runs. For mixed max-sum-SSP with LTL constraints, careful state indexing, discretization, and horizon selection are recommended, with explicit notes on parallelization and numerical stability (Chatterjee et al., 2020, Zhang et al., 14 Dec 2025). In practice, the action pruning from higher-priority objectives reduces the effective search space.

Paper Title	Setting	Notable Features/Advances
Stochastic Games with Lexicographic Reachability-Safety Objectives (Chatterjee et al., 2020)	Stochastic games	Lex priorities in reach/safety; memory, comp.
Lexicographic Multi-Objective RL (Skalse et al., 2022)	Multi-objective RL, MDPs	RL algorithms; convergence
Lexicographic Multi-Objective SSP w/ Mixed Max-Sum Costs (Zhang et al., 14 Dec 2025)	SSP, LTL constraints	SSP + bottleneck costs, product MDPs

This table juxtaposes major lexicographic VI research papers, their domains, and core algorithmic contributions.

References

"Stochastic Games with Lexicographic Reachability-Safety Objectives" (Chatterjee et al., 2020)
"Lexicographic Multi-Objective Stochastic Shortest Path with Mixed Max-Sum Costs" (Zhang et al., 14 Dec 2025)
"Lexicographic Multi-Objective Reinforcement Learning" (Skalse et al., 2022)

Markdown Report Issue Upgrade to Chat

References (3)

Lexicographic Multi-Objective Reinforcement Learning (2022)

Stochastic Games with Lexicographic Reachability-Safety Objectives (2020)

Lexicographic Multi-Objective Stochastic Shortest Path with Mixed Max-Sum Costs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lexicographic Value Iteration Algorithm.