Papers
Topics
Authors
Recent
Search
2000 character limit reached

ICVaR Sparse Sampling in Risk-Averse POMDPs

Updated 29 January 2026
  • ICVaR Sparse Sampling is an online planning algorithm that leverages the Iterated Conditional Value-at-Risk (ICVaR) objective to manage tail risks in partially observable Markov decision processes.
  • It adapts the sparse sampling framework by replacing expectation with empirical CVaR estimators, ensuring robust risk control and finite-time performance guarantees.
  • The method integrates risk-aware exploration strategies into Monte Carlo tree search algorithms, achieving significant tail-risk reductions in benchmarks like LaserTag and LightDark.

Iterated Conditional Value-at-Risk (ICVaR) Sparse Sampling is an online planning algorithm designed for risk-averse policy construction in partially observable Markov decision processes (POMDPs) using dynamic, tail-focused risk metrics. Unlike standard expectation-based sparse sampling, ICVaR Sparse Sampling targets the ICVaR objective—a time-consistent extension of CVaR—offering finite-time guarantees regardless of action set cardinality. This approach enables robust handling of risk under partial observability and is foundational in extending risk-averse planning to modern Monte Carlo search frameworks (Pariente et al., 28 Jan 2026).

1. Formulation of the ICVaR Objective in POMDPs

Consider a finite-horizon POMDP M=(X,A,Z,T,O,c,γ,b0)M=(X, A, Z, T, O, c, \gamma, b_0) with policy π\pi. To address intractable belief updates, the process is reformulated as a particle-belief MDP Mp=(Σ,A,τ,ρ,γ)M_p = (\Sigma, A, \tau, \rho, \gamma) using NpN_p weighted particles, where each belief btb_t is the empirical measure bˉt={(xi,wi)}i=1Np\bar{b}_t = \{(x_i, w_i)\}_{i=1}^{N_p}, and w~i\tilde{w}_i are normalized weights.

The ICVaR action-value function (Eq. 10–11) is defined recursively by

QM,tπ(bt,a,α):=c(bt,a)+γCVaRαP[VM,t+1π(bt+1,α)bt,a,π],Q_{M, t}^{\pi}(b_t, a, \alpha) := c(b_t, a) + \gamma \cdot \text{CVaR}_\alpha^P \big[V_{M, t+1}^\pi(b_{t+1}, \alpha) \mid b_t, a, \pi\big],

where

CVaRαP[Vbt,a]:=CVaRαbt+1P(bt,a)[V(bt+1,α)].\text{CVaR}_\alpha^P\left[V \mid b_t, a\right] := \text{CVaR}_\alpha^{b_{t+1} \sim P(\cdot \mid b_t, a)} [V(b_{t+1}, \alpha)].

The value function is VM,tπ(bt,α)=QM,tπ(bt,π(bt),α)V_{M,t}^{\pi}(b_t, \alpha) = Q_{M,t}^\pi(b_t, \pi(b_t), \alpha) with terminal condition VM,tπ=0V_{M, t}^{\pi} = 0 for t>Tt > T.

For tractable computation, particle sampling replaces expectations: QMp,tπ(bˉt,a,α)=ρ(bˉt,a)+γCVaRαMp[VMp,t+1π(bˉt+1,α)bˉt,a],Q_{M_p, t}^{\pi}(\bar{b}_t, a, \alpha) = \rho(\bar{b}_t, a) + \gamma\, \text{CVaR}_\alpha^{M_p}\big[V_{M_p, t+1}^{\pi}(\bar{b}_{t+1}, \alpha) \mid \bar{b}_t, a\big], and VMp,tπ(bˉt,α)=QMp,tπ(bˉt,π(bˉt),α)V_{M_p, t}^{\pi}(\bar{b}_t, \alpha) = Q_{M_p, t}^{\pi}(\bar{b}_t, \pi(\bar{b}_t), \alpha).

The parameter α(0,1]\alpha \in (0,1] modulates risk: α=1\alpha = 1 recovers expectation (risk-neutral), while smaller α\alpha increases risk aversion.

2. ICVaR Sparse Sampling Algorithm

ICVaR Sparse Sampling adapts the depth-TT sparse sampling paradigm to the ICVaR objective, changing the recursion to optimize the empirical CVaR tail mean. For each action at each decision node, the aggregation of successor values is via the empirical C^α()\widehat{C}_\alpha(\cdot) estimator (Brown, 2007), not the mean. The main steps are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
Function EstimateV*(\bar{b}, t):
    if t ≥ T: return 0
    For each a ∈ A:
        Q̂(a) ← EstimateQ*(\bar{b}, a, t)
    a* ← argmin_a Q̂(a)
    return V̂*(\bar{b}, t) = Q̂(a*)

Function EstimateQ*(\bar{b}, a, t):
    For i = 1…N_b:
        (\bar{b}_i', cost_i) ← GenPF(\bar{b}, a)   # Generate successor belief/cost
        V_i ← EstimateV*(\bar{b}_i', t+1)
    ρ̄ ← (1/N_b) ∑_i cost_i
    return Q̂*(\bar{b}, a, t) = ρ̄ + γ · \widehat{C}_α({V_i}_{i=1}^{N_b})

Here,

C^α({Vi})=infrR[r+1αNbi=1Nb(Vir)+],\widehat{C}_\alpha(\{V_i\}) = \inf_{r \in \mathbb{R}} \left[ r + \frac{1}{\alpha N_b} \sum_{i=1}^{N_b} (V_i - r)^+ \right],

where ()+=max(,0)(\cdot)^+ = \max(\cdot, 0). The key distinctions from standard sparse sampling are:

  • Successor aggregation uses tail mean (empirical CVaR) via C^α\widehat{C}_\alpha instead of arithmetic mean.
  • Action selection minimizes the CVaR-based QQ estimates.

3. Finite-Time Performance Guarantees

Finite-time error bounds for ICVaR Sparse Sampling ensure the estimated value function remains close to the optimal despite sampling variability and risk aversion. Let ΔR=RmaxRmin\Delta R = R_{\max} - R_{\min}. Define

Tα,t:=k=0Tt1Ttkαk,Tα,t:=j=0Tt1Tt+1jαjT_{\alpha, t} := \sum_{k=0}^{T-t-1} \frac{T-t-k}{\alpha^k}, \quad T'_{\alpha, t} := \sum_{j=0}^{T-t-1} \frac{T-t+1-j}{\alpha^j}

Fix δ(0,1)\delta \in (0,1). With Nb>1N_b>1 and A|A| actions, at belief bˉt\bar{b}_t, with probability at least 1δ1-\delta: Upper bound:    VMp,t(bˉt,α)V^Mp,t(bˉt,α) γΔRTα,t5ln(3A((ANb)Tt1)δ(ANb1))αNb\begin{align*} \text{Upper bound:}\;\; & V_{M_p, t}^*(\bar{b}_t, \alpha) - \widehat{V}_{M_p, t}^*(\bar{b}_t, \alpha) \ &\leq \gamma \Delta R T_{\alpha, t} \sqrt{ \frac{5 \ln \left( \frac{3|A|((|A|N_b)^{T-t}-1)}{\delta(|A|N_b-1)} \right)}{\alpha N_b} } \end{align*}

Lower bound:    VMp,t(bˉt,α)V^Mp,t(bˉt,α) γΔRαln(A((ANb)Tt1)δ(ANb1))2NbTα,t\begin{align*} \text{Lower bound:}\;\; & V_{M_p, t}^*(\bar{b}_t, \alpha) - \widehat{V}_{M_p, t}^*(\bar{b}_t, \alpha) \ &\geq -\frac{\gamma \Delta R}{\alpha} \sqrt{ \frac{ \ln \left( \frac{|A|((|A|N_b)^{T-t}-1)}{ \delta (|A|N_b-1) } \right) }{ 2 N_b } T'_{\alpha, t} } \end{align*}

The proof decomposes error into (I) concentration of empirical CVaR and (II) propagated estimation error using a one-sided subtraction bound, leveraging union bounds across actions and tree depths to obtain explicit dependence on NbN_b, α\alpha, TT, and A|A|.

4. Exploration Strategy Tailored to ICVaR

For efficient exploration in tree search, ICVaR exploits a UCB-type bonus derived from the risk-sensitive lower confidence bound (Theorem 2 in the original work). At any history hh and among expanded actions C(h)C(h), action selection employs: $a \leftarrow \argmin_{a \in C(h)} \left[ V(ha) - c \cdot \sqrt{ \frac{ \ln \left( \frac{1 - M(h)^{T-t} \delta (1-M(h))}{ \alpha M(ha) } \right) } } \right],$ where M(ha)M(ha) is the visit count for action node haha. Progressive widening remains standard for both actions and observations, but when choosing among children, the above ICVaR-UCB replaces the usual lnN(h)/N(ha)\sqrt{\ln N(h)/N(ha)} mean-based uncertainty bonus. This strategy, denoted "ICvarExploration," is directly adapted from the finite-sample lower bound of the policy evaluation error.

5. Empirical Evaluation in Online Planners

While direct experiments for exhaustive ICVaR Sparse Sampling are not reported due to its exponential complexity in AT|A|^T, ICVaR is empirically evaluated within ICVaR-POMCPOW and ICVaR-PFT-DPW on standard POMDP benchmarks:

  • LaserTag: Discrete state-action, continuous observation space.
  • LightDark: Continuous state, action, and observation spaces.

Both planners use a per-step budget of 4 seconds, planning horizon T=10T=10, risk level α=0.1\alpha=0.1, and confidence δ=0.05\delta=0.05. Value estimation in the tree is performed using the policy evaluation algorithm (with Nb=5N_b = 5, evaluation horizon 3). The metric is ICVaR0.1\text{ICVaR}_{0.1} of total cost (lower is better).

Method LaserTag LightDark
POMCPOW 15.06 ± 0.40 25.73 ± 0.96
ICVaR-POMCPOW 12.47 ± 0.46 16.72 ± 0.08
PFT-DPW 26.04 ± 0.91 37.68 ± 1.68
ICVaR-PFT-DPW 16.33 ± 0.61 18.52 ± 0.23

Tail-risk reductions are pronounced: ICVaR-POMCPOW reduces tail cost by 17% (LaserTag) and 35% (LightDark), while ICVaR-PFT-DPW achieves 37% (LaserTag) and 51% (LightDark) lower tail cost compared to risk-neutral planners. This demonstrates the impact of targeting tail risks rather than expected costs in domains with pronounced risk structures.

6. Discussion and Scope of Application

ICVaR Sparse Sampling provides a risk-sensitive alternative to expectation-based sparse sampling by leveraging time-consistent tail measures and enabling policy construction under explicit tail risk constraints. Its finite-sample guarantees decouple error from the action set size and introduce explicit dependence on the risk-aversion parameter α\alpha, providing meaningful control for practitioners in safety-critical or risk-aware planning scenarios.

Although direct use in exhaustive tree search is computationally intensive, the methodology enables scalable ICVaR-based planning in online settings through incorporation in progressive-widening–based algorithms such as POMCPOW and PFT-DPW. This framework is thus significant for extending robust, risk-averse planning to practical POMDPs where tail outcomes dictate performance and safety (Pariente et al., 28 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ICVaR Sparse Sampling.