DeepOS: Deep Optimal Stopping Algorithm

Updated 6 October 2025

DeepOS is a deep learning algorithm that learns data-driven stopping rules from simulated sample paths, addressing discrete and continuous optimal stopping problems.
It employs a sequence of neural networks to make binary stop-or-continue decisions, enabling efficient backward induction without nested simulations.
DeepOS provides tight lower and dual upper bound estimates for optimal stopping values, demonstrating effectiveness in high-dimensional and non-Markovian applications like Bermudan options.

The Deep Optimal Stopping (DeepOS) algorithm refers to a class of deep learning methodologies that address discrete- or continuous-time optimal stopping problems by directly learning stopping rules from simulated sample paths. The central goal is to maximize (or minimize) the expected reward by determining a data-driven stopping time, usually in high or very high-dimensional settings. DeepOS unifies the representation of stopping rules, neural network function approximation, Monte Carlo simulation, and stochastic optimization into a practical algorithmic framework that generalizes across Markovian, non-Markovian, and path-dependent scenarios, as exemplified in applications such as Bermudan max-call options, callable multi-barrier convertibles, and optimal stopping of fractional Brownian motion (Becker et al., 2018).

1. Mathematical Framing and Neural Representation of Stopping Times

The DeepOS method solves the problem

$V_0 = \sup_{\tau \in \mathcal{T}} \mathbb{E} [g(\tau, X_\tau)]$

where $X = (X_n)_{n=0}^N \subset \mathbb{R}^d$ is a (potentially high-dimensional) stochastic process, $\mathcal{T}$ is the set of admissible discrete stopping times, and $g(n, X_n)$ is the reward for stopping at time $n$ with process state $X_n$ . The key insight is to restructure any stopping time $\tau$ as a sequence of Markovian binary decisions: $\tau = \sum_{n=1}^{N} n\, f_n(X_n) \prod_{j=0}^{n-1}(1-f_j(X_j))$ where each $f_n : \mathbb{R}^d \to \{0,1\}$ corresponds to the decision to stop (1) or continue (0) at time $n$ .

In DeepOS, each $f_n$ is approximated via a neural network; in the canonical architecture:

Each $f_n$ is implemented as a deep feedforward network with affine transformations, ReLU activations in hidden layers, and a logistic (sigmoid) output: $F^\theta(x) = \psi \circ a_I^\theta \circ \varphi_{q_{I-1}} \circ a_{I-1}^\theta \circ \cdots \circ \varphi_{q_1} \circ a_1^\theta(x)$ where $a_i^\theta(x) = A_i x + b_i$ , $\varphi_{q_i}$ is applied componentwise, and $\psi(z) = \frac{1}{1 + e^{-z}}$ .
For hard 0–1 decisions, $f^\theta(x) = 1_{[0,\infty)}(a_I^\theta \circ \varphi_{q_{I-1}}\circ \cdots \circ a_1^\theta(x))$ .

The parameter set $\theta = (\theta_0, \ldots, \theta_{N-1})$ for all times is optimized via stochastic gradient ascent over mini-batches of simulated paths.

2. Training Procedure and Backward Optimization

The optimization objective for time $n$ is to maximize: $\mathbb{E}\left[ g(n, X_n) F^\theta(X_n) + g(\tau_{n+1}, X_{\tau_{n+1}})(1 - F^\theta(X_n)) \right]$ There are two scenarios for each path at time $n$ : immediate stopping, or continuation, with the continuation value approximated using the future reward along a recursively determined stopping time $\tau_{n+1}$ (built from yet-to-be-trained networks for later steps). This backward induction enables the direct training of the stopping policy without need for nested simulation or parametric approximation of continuation values.

The training loop combines Monte Carlo simulation of the process $X$ , calculation of empirical gradients, use of batch normalization, Xavier initialization, and adaptive optimizers such as Adam to mitigate instability and variance in the stochastic gradients.

3. Applications and High-Dimensional Use Cases

DeepOS is validated on several classes of canonical optimal stopping problems:

Application	Model/State Description	Payoff Structure
Bermudan Max‐Call	$d$ -dimensional Black–Scholes, $S^i_t = s^i_0 \exp(...)$	$g(n, x) = e^{-rt_n}(\max_i x^i - K)^+$
Callable MBRC	$d+1$ state (multi-asset + barrier), complex path-dependence	Piecewise by early exercise vs. maturity/barrier breach
fBm Stopping	fBm, Hurst $H$ , non-Markov, history embedded: $X_n = (W^H_{t_n}, \ldots)$	$g(x) = x^1$ (last fBm entry) as reward

Notably, DeepOS is effective up to dimensions $d = 500$ for max-call, with runtimes of ~100–150 seconds for lower and upper bound estimates. Non-Markovian applications (e.g. fBm) involve state vector augmentation to recast the process as Markovian in a higher dimensional space.

4. Performance Evaluation: Lower/Upper Bounds and Efficiency

A distinguishing feature is the simultaneous computation of lower and dual upper bounds for the optimal value $V_0$ :

The lower bound is the expected reward under the learned stopping policy.
The upper bound is computed via a dual representation, requiring nested simulation from candidate states (but not nested inside the value estimation for each sample).

Empirical results show the bounds are typically very tight, providing strong numerical certification of (near-)optimality. For example, with $d=100$ assets, lower and upper bounds on the Bermudan max-call diverge by less than 0.5%, and the approach remains computationally feasible in dimensions intractable for lattice methods.

5. Simulation and Neural Network Architecture Considerations

Crucial to DeepOS is the simulation of the underlying stochastic process:

For Black–Scholes models, asset paths are generated using the standard SDE discretization, with independent or correlated Brownian increments.
For non-Markovian (e.g., fBm) processes, simulation employs covariance-based sampling—e.g., via Cholesky decomposition—to generate correlated increments.

Architecturally, neural networks are problem-agnostic and consist of several dense layers (2–4 hidden layers typical, hard sigmoid for outputs). Proposition 2 in (Becker et al., 2018) establishes the universality of this structure: given enough width/depth, arbitrarily accurate approximation of the optimal stopping rule is guaranteed.

Highly accurate training requires large-scale simulation: for instance, $K_L = 4\,096\,000$ sample paths for lower bound estimation in high-dimensional test cases.

6. Advantages, Scope, and Limitations

Advantages:

Direct learning from simulation, bypassing the need for manually engineered basis functions or explicit backward dynamic programming.
Applicability to both maximization and minimization problems, as well as to non-Markovian and high-dimensional settings.
Provides both an explicit (neural) optimal stopping rule and certified bounds on $V_0$ .

Limitations and Open Directions:

Training requires substantial computational resources in high dimensions, especially when millions of simulated paths are needed for statistical accuracy.
The approach relies on the ability to simulate the underlying process efficiently; in problems where simulation is expensive, further algorithmic development is required.
Hyperparameter selection (depth, width) and architecture tuning remain manual; automation or meta-learning could be explored.
Extensions to more general control tasks and continuous-time formulations (e.g., via recurrent or signature-based architectures) comprise active research areas.

7. Impact and Generalizations

The DeepOS methodology—by combining probabilistic representation of stopping times with neural network parametrization and direct Monte Carlo optimization—has significantly broadened the practical reach of optimal stopping analysis in high-dimensional settings. It offers a framework for general stochastic and financial models, including those with complex, path-dependent or non-Markovian structures, where traditional methods are infeasible (Becker et al., 2018).

Subsequent research has elaborated on the DeepOS paradigm to include combinatorial relaxations, randomized neural networks, signature-based functionals, primal-dual BSDE approaches, and penalization schemes (see e.g. (Peng et al., 2024, Yang et al., 2024, Gao et al., 2022)). The theoretical underpinnings relating to the polynomial complexity and expressivity of deep neural networks in optimal stopping (see (Gonon, 2022)) further justify scaling the DeepOS approach to very high dimensions.

In summary, DeepOS establishes a scalable, data-driven approach to optimal stopping that is adaptive to high-dimensional, path-dependent, and simulation-based settings, providing both practical algorithms and a template for ongoing methodological innovations in computational stochastic control.