Linear Contextual Stochastic Shortest Path
- Linear CSSP is a framework where both transition probabilities and costs depend linearly on exogenous context variables, extending traditional SSP models.
- Key algorithmic advances include the use of optimistic planning, ridge regression-based confidence sets, and surrogate optimization to ensure proper policy termination.
- Theoretical results provide minimax regret bounds and practical insights into balancing model uncertainty with contextual variability in online learning.
A Linear Contextual Stochastic Shortest Path (CSSP) problem generalizes classical stochastic shortest path (SSP) settings by allowing both transitions and costs to depend on exogenous context variables via an unknown linear mapping. The learner’s goal is to minimize cumulative expected loss to an absorbing goal state across sequential episodes, under model uncertainty and online feedback. Recent research rigorously formulates CSSP and provides minimax regret guarantees, efficient algorithms, and theoretical insights into the dynamics between contextual variability, policy learning, and termination reliability (Polikar et al., 16 Nov 2025, Vial et al., 2021, Chen et al., 2021, Min et al., 2021, Hu et al., 2024).
1. Problem Formulation and Model Structure
A linear CSSP instance consists of a finite state space , action space , and a distinguished absorbing goal state . At episode , the learner observes a context (typically the simplex) which adversarially or stochastically parameterizes the underlying MDP:
- Cost function: , where are unknown cost embeddings.
- Transition kernel: , , with .
- Each episode proceeds until the process hits , incurring cumulative loss.
The objective is to minimize the expected cumulative loss for starting state and episode-specific context , without prior knowledge of or , and under adversarial or stochastic selection of context (Polikar et al., 16 Nov 2025, Hu et al., 2024).
A policy is proper if it reaches with probability one under any feasible context.
2. Algorithmic Frameworks for CSSP
The principal approaches for online learning in linear CSSP involve optimistic planning, ridge regression-based confidence sets, and episodic replanning.
LR-CSSP Algorithm (Polikar et al., 16 Nov 2025):
- Maintains per- ridge regression estimates for cost and transition parameters using all historical data across episodes.
- Uses confidence ellipsoids in parameter space, combined with current context , to derive context-specific confidence sets for costs/transitions.
- Performs optimistic extended value iteration (EVI) over the confidence sets to compute a proper policy at each interval.
- Declares "known" for context when the associated regression uncertainty is below a threshold, else replans upon visiting an "unknown" pair.
Induced Empirical Risk Minimization (IERM) Under Bandit Feedback (Hu et al., 2024):
- Observes only realized loss for each played path/decision, not the full contextual cost vector.
- Forms cross-fitted nuisance estimators (context-cost functions, propensity scores) and optimizes convex surrogate losses (SPO+) for the policy function class.
- Scores can be constructed using direct method, inverse spectral weighting, or doubly-robust estimation for theoretical fast rates.
Linear Mixture SSP Algorithms (LEVIS/LEVIS+) (Min et al., 2021):
- Utilize Hoeffding or Bernstein-type confidence sets for the unknown transition model, updated at cost-doubling intervals.
- Execute damped extended value iteration (DEVI) to ensure contraction and optimism.
- Bernstein refinement uses transition variance to speed up confidence contraction.
3. Regret Guarantees and Theoretical Results
General Regret Bounds:
- LR-CSSP achieves horizon-free regret for arbitrary, possibly tiny costs. If , , where is a bound on the optimal cumulative cost and bounds policy hitting time (Polikar et al., 16 Nov 2025).
- For linear mixture SSPs, Bernstein-type LEVIS+ achieves minimax-optimal regret, up to poly-logarithmic and factors (Min et al., 2021).
- In bandit feedback contexts, IERM achieves fast rates proportional to the critical radius of the policy function class and nuisance estimators. If misspecification is negligible and nuisance is well estimated, regret scales as for margin parameter (Hu et al., 2024).
Comparison Table of Regret Guarantees
| Algorithm/Class | Regret Bound | Assumptions/Notes |
|---|---|---|
| LR-CSSP (Polikar et al., 16 Nov 2025) | Arbitrary small cost, adversarial context | |
| LR-CSSP () | Cost bounded below | |
| LEVIS (Min et al., 2021) | Linear mixture, Hoeffding confidence | |
| LEVIS+ (Bernstein) (Min et al., 2021) | Minimax up to slack/log factors | |
| IERM (Hu et al., 2024) | Fast rate in critical radius | Offline, bandit feedback, misspec. |
These bounds indicate statistical efficiency in (context/feature dim.) and problem size, with termination ensured unless improper policies are selected.
4. Proper Policy Class and Termination Guarantees
Unlike in finite-horizon contextual MDPs, in CSSP the challenge extends beyond parameter estimation:
- The episode length is a random variable dependent on transition uncertainty and policy quality.
- Insufficient knowledge may lead to recurrent cycles or even infinite-horizon episodes, violating problem well-posedness.
- Algorithms guarantee properness by replanning upon any with insufficient confidence, and constructing policies only over known pairs (Polikar et al., 16 Nov 2025, Min et al., 2021).
- The key model-dependent control parameters for termination are and .
5. Methodological Advances and Practical Insights
Key advances include:
- Construction of context-dependent confidence sets via ridge regression embedding, enabling generalization across arbitrary continuous context spaces (Polikar et al., 16 Nov 2025).
- Surrogate optimization (SPO+) and doubly-robust scores, which provide robust learning even under model misspecification and bandit feedback (Hu et al., 2024). Empirically, convex surrogates and regularized DR estimators outperform plug-in methods, especially in high-dimensional and misspecified settings.
- Exploitation of problem linearity: regret bounds, computational cost, and confidence set updates all scale in but not directly in (with the exception of count-based modeling for unknown pairs).
- Variance-aware confidence sets (Bernstein-type) remove extraneous factors in regret bounds, matching minimax limits (Min et al., 2021).
Example empirical results (Hu et al., 2024):
- Relative regret for SPO+ DM with samples: (well-specified), (misspecified); naive methods without linear structure incur regret.
6. Connections to Related Paradigms
- The CSSP framework generalizes tabular SSP, linear MDPs, and discounted/episodic contextual MDPs (Chen et al., 2021).
- Importantly, the policy termination issue is unique to CSSP: finite-horizon models always terminate after steps, but CSSP must guarantee properness under continuous context variability.
- Regret-optimal algorithms for CSSP make explicit use of optimism, confidence-driven replanning, and robust estimators, resolving longstanding open questions on termination assurance and statistical rates under adversarial context (Polikar et al., 16 Nov 2025, Vial et al., 2021).
- Current methods are robust to adversarial context selection and arbitrary model misspecification provided underlying linearity is maintained.
7. Open Problems and Future Directions
- Removing the slack from Bernstein-type algorithms to match minimax rates for arbitrarily small or zero costs (Min et al., 2021).
- Extending confidence-based approaches to nonlinear function approximation regimes—potentially under low Bellman rank or other generalized features.
- Closing remaining gaps (e.g., factors) in horizon-free regret bounds with computationally efficient, scalable algorithms (Chen et al., 2021).
- Improving practical integration of variance-aware confidence and surrogate losses for scalable deployment of CSSP learning in large, dynamic, and continuous-context domains.
Linear Contextual Stochastic Shortest Path thus represents an active research area at the interface of reinforcement learning, robust online optimization, and contextual decision processes, with recent work establishing statistical, computational, and termination guarantees under rigorous linearity-based modeling assumptions.