Papers
Topics
Authors
Recent
2000 character limit reached

DeepOS: Deep Optimal Stopping Algorithm

Updated 6 October 2025
  • DeepOS is a deep learning algorithm that learns data-driven stopping rules from simulated sample paths, addressing discrete and continuous optimal stopping problems.
  • It employs a sequence of neural networks to make binary stop-or-continue decisions, enabling efficient backward induction without nested simulations.
  • DeepOS provides tight lower and dual upper bound estimates for optimal stopping values, demonstrating effectiveness in high-dimensional and non-Markovian applications like Bermudan options.

The Deep Optimal Stopping (DeepOS) algorithm refers to a class of deep learning methodologies that address discrete- or continuous-time optimal stopping problems by directly learning stopping rules from simulated sample paths. The central goal is to maximize (or minimize) the expected reward by determining a data-driven stopping time, usually in high or very high-dimensional settings. DeepOS unifies the representation of stopping rules, neural network function approximation, Monte Carlo simulation, and stochastic optimization into a practical algorithmic framework that generalizes across Markovian, non-Markovian, and path-dependent scenarios, as exemplified in applications such as Bermudan max-call options, callable multi-barrier convertibles, and optimal stopping of fractional Brownian motion (Becker et al., 2018).

1. Mathematical Framing and Neural Representation of Stopping Times

The DeepOS method solves the problem

V0=supτTE[g(τ,Xτ)]V_0 = \sup_{\tau \in \mathcal{T}} \mathbb{E} [g(\tau, X_\tau)]

where X=(Xn)n=0NRdX = (X_n)_{n=0}^N \subset \mathbb{R}^d is a (potentially high-dimensional) stochastic process, T\mathcal{T} is the set of admissible discrete stopping times, and g(n,Xn)g(n, X_n) is the reward for stopping at time nn with process state XnX_n. The key insight is to restructure any stopping time τ\tau as a sequence of Markovian binary decisions: τ=n=1Nnfn(Xn)j=0n1(1fj(Xj))\tau = \sum_{n=1}^{N} n\, f_n(X_n) \prod_{j=0}^{n-1}(1-f_j(X_j)) where each fn:Rd{0,1}f_n : \mathbb{R}^d \to \{0,1\} corresponds to the decision to stop (1) or continue (0) at time nn.

In DeepOS, each fnf_n is approximated via a neural network; in the canonical architecture:

  • Each fnf_n is implemented as a deep feedforward network with affine transformations, ReLU activations in hidden layers, and a logistic (sigmoid) output: Fθ(x)=ψaIθφqI1aI1θφq1a1θ(x)F^\theta(x) = \psi \circ a_I^\theta \circ \varphi_{q_{I-1}} \circ a_{I-1}^\theta \circ \cdots \circ \varphi_{q_1} \circ a_1^\theta(x) where aiθ(x)=Aix+bia_i^\theta(x) = A_i x + b_i, φqi\varphi_{q_i} is applied componentwise, and ψ(z)=11+ez\psi(z) = \frac{1}{1 + e^{-z}}.
  • For hard 0–1 decisions, fθ(x)=1[0,)(aIθφqI1a1θ(x))f^\theta(x) = 1_{[0,\infty)}(a_I^\theta \circ \varphi_{q_{I-1}}\circ \cdots \circ a_1^\theta(x)).

The parameter set θ=(θ0,,θN1)\theta = (\theta_0, \ldots, \theta_{N-1}) for all times is optimized via stochastic gradient ascent over mini-batches of simulated paths.

2. Training Procedure and Backward Optimization

The optimization objective for time nn is to maximize: E[g(n,Xn)Fθ(Xn)+g(τn+1,Xτn+1)(1Fθ(Xn))]\mathbb{E}\left[ g(n, X_n) F^\theta(X_n) + g(\tau_{n+1}, X_{\tau_{n+1}})(1 - F^\theta(X_n)) \right] There are two scenarios for each path at time nn: immediate stopping, or continuation, with the continuation value approximated using the future reward along a recursively determined stopping time τn+1\tau_{n+1} (built from yet-to-be-trained networks for later steps). This backward induction enables the direct training of the stopping policy without need for nested simulation or parametric approximation of continuation values.

The training loop combines Monte Carlo simulation of the process XX, calculation of empirical gradients, use of batch normalization, Xavier initialization, and adaptive optimizers such as Adam to mitigate instability and variance in the stochastic gradients.

3. Applications and High-Dimensional Use Cases

DeepOS is validated on several classes of canonical optimal stopping problems:

Application Model/State Description Payoff Structure
Bermudan Max‐Call dd-dimensional Black–Scholes, Sti=s0iexp(...)S^i_t = s^i_0 \exp(...) g(n,x)=ertn(maxixiK)+g(n, x) = e^{-rt_n}(\max_i x^i - K)^+
Callable MBRC d+1d+1 state (multi-asset + barrier), complex path-dependence Piecewise by early exercise vs. maturity/barrier breach
fBm Stopping fBm, Hurst HH, non-Markov, history embedded: Xn=(WtnH,)X_n = (W^H_{t_n}, \ldots) g(x)=x1g(x) = x^1 (last fBm entry) as reward

Notably, DeepOS is effective up to dimensions d=500d = 500 for max-call, with runtimes of ~100–150 seconds for lower and upper bound estimates. Non-Markovian applications (e.g. fBm) involve state vector augmentation to recast the process as Markovian in a higher dimensional space.

4. Performance Evaluation: Lower/Upper Bounds and Efficiency

A distinguishing feature is the simultaneous computation of lower and dual upper bounds for the optimal value V0V_0:

  • The lower bound is the expected reward under the learned stopping policy.
  • The upper bound is computed via a dual representation, requiring nested simulation from candidate states (but not nested inside the value estimation for each sample).

Empirical results show the bounds are typically very tight, providing strong numerical certification of (near-)optimality. For example, with d=100d=100 assets, lower and upper bounds on the Bermudan max-call diverge by less than 0.5%, and the approach remains computationally feasible in dimensions intractable for lattice methods.

5. Simulation and Neural Network Architecture Considerations

Crucial to DeepOS is the simulation of the underlying stochastic process:

  • For Black–Scholes models, asset paths are generated using the standard SDE discretization, with independent or correlated Brownian increments.
  • For non-Markovian (e.g., fBm) processes, simulation employs covariance-based sampling—e.g., via Cholesky decomposition—to generate correlated increments.

Architecturally, neural networks are problem-agnostic and consist of several dense layers (2–4 hidden layers typical, hard sigmoid for outputs). Proposition 2 in (Becker et al., 2018) establishes the universality of this structure: given enough width/depth, arbitrarily accurate approximation of the optimal stopping rule is guaranteed.

Highly accurate training requires large-scale simulation: for instance, KL=4096000K_L = 4\,096\,000 sample paths for lower bound estimation in high-dimensional test cases.

6. Advantages, Scope, and Limitations

Advantages:

  • Direct learning from simulation, bypassing the need for manually engineered basis functions or explicit backward dynamic programming.
  • Applicability to both maximization and minimization problems, as well as to non-Markovian and high-dimensional settings.
  • Provides both an explicit (neural) optimal stopping rule and certified bounds on V0V_0.

Limitations and Open Directions:

  • Training requires substantial computational resources in high dimensions, especially when millions of simulated paths are needed for statistical accuracy.
  • The approach relies on the ability to simulate the underlying process efficiently; in problems where simulation is expensive, further algorithmic development is required.
  • Hyperparameter selection (depth, width) and architecture tuning remain manual; automation or meta-learning could be explored.
  • Extensions to more general control tasks and continuous-time formulations (e.g., via recurrent or signature-based architectures) comprise active research areas.

7. Impact and Generalizations

The DeepOS methodology—by combining probabilistic representation of stopping times with neural network parametrization and direct Monte Carlo optimization—has significantly broadened the practical reach of optimal stopping analysis in high-dimensional settings. It offers a framework for general stochastic and financial models, including those with complex, path-dependent or non-Markovian structures, where traditional methods are infeasible (Becker et al., 2018).

Subsequent research has elaborated on the DeepOS paradigm to include combinatorial relaxations, randomized neural networks, signature-based functionals, primal-dual BSDE approaches, and penalization schemes (see e.g. (Peng et al., 18 May 2024, Yang et al., 11 Sep 2024, Gao et al., 2022)). The theoretical underpinnings relating to the polynomial complexity and expressivity of deep neural networks in optimal stopping (see (Gonon, 2022)) further justify scaling the DeepOS approach to very high dimensions.

In summary, DeepOS establishes a scalable, data-driven approach to optimal stopping that is adaptive to high-dimensional, path-dependent, and simulation-based settings, providing both practical algorithms and a template for ongoing methodological innovations in computational stochastic control.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Deep Optimal Stopping (DeepOS) Algorithm.