Entropy-regularized penalization schemes for American options and reflected BSDEs with singular generators

Published 20 Feb 2026 in q-fin.MF | (2602.18078v1)

Abstract: This paper extends our previous work in Chee et al. [9] to continuous-time optimal stopping problems, with a particular focus on American options within an exploratory framework. We pursue two main objectives. First, motivated by reinforcement learning applications, we introduce an entropy-regularized penalization scheme for continuous-time optimal stopping problems. The scheme is inspired by classical penalization techniques for reflected backward stochastic differential equations (RBSDEs) and provides a smooth approximation of the degenerate stopping rule inherent to the American option problem. This regularization promotes exploration, enables the use of gradient-based optimization methods, and leads naturally to policy improvement algorithms. We establish well-posedness and convergence properties of the scheme, and illustrate its numerical feasibility through low-dimensional experiments based on policy iteration and least-squares Monte Carlo methods. Second, from a theoretical perspective, we study the asymptotic limit of the entropy-regularized penalization as the penalization parameter tends to infinity. We show that the limiting value process solves a reflected BSDE with a logarithmically singular driver, and we prove existence and uniqueness of solutions to this new class of RBSDEs via a monotone limit argument. To the best of our knowledge, such equations have not previously been investigated in the literature

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces entropy regularization in American option valuation, reformulating the optimal stopping problem via reflected BSDEs with singular generators.
It establishes well-posedness and provides quantitative convergence rates under various penalization regimes, including limiting cases with logarithmic singularities.
Numerical experiments confirm that the entropy-regularized scheme and policy improvement algorithm converge efficiently, offering stability over classical methods.

Entropy-Regularized Penalization Schemes for American Options and Reflected BSDEs with Singular Generators

Overview

This paper (2602.18078) analyzes the application of entropy-regularized penalization schemes to the valuation of American options via reflected backward stochastic differential equations (RBSDEs), focusing on continuous-time optimal stopping with singular generator structures. The work introduces a probabilistic, model-agnostic approach leveraging entropy-regularization both for theoretical regularity and for algorithmic feasibility within reinforcement learning frameworks. It establishes well-posedness and convergence for various penalization regimes, quantifies approximation rates, and elucidates the limiting behavior as penalization parameters diverge, including the emergence of RBSDEs with logarithmic singularities. Numerical experiments highlight practical implementation, and the theoretical analysis expands the singular RBSDE literature.

Formulation of Entropy-Regularized Penalization

The classical optimal stopping problem for American options, with value process $V_t$ , is reframed using randomized stopping via stopping intensities ( $\gamma$ ), following Gyöngy and Šiška. The associated BSDE is:

$V_t = P_T - (M_T - M_t) + \esssup_{\gamma \in \Lambda} \int_t^T (P_s - V_s)\gamma_s ds,$

where optimal $\gamma$ assumes degenerate values ($0$ or $\infty$ ), introducing analytical and numerical difficulties for gradient-based RL methods. The paper proposes truncating $\gamma$ to $[0, n]$ and applies classical penalization:

$V^n_t = P_T - (M^n_T - M^n_t) + \int_t^T n(P_s - V^n_s)^+ ds,$

then extends to entropy-regularized controls (for temperature parameter $\lambda$ ):

$V^{\lambda,n}_t = P_T - (M^{\lambda,n}_T - M^{\lambda,n}_t) + \esssup_{\pi \in \Pi_n} \int_t^T \int_0^n [(P_s - V^{\lambda,n}_s)u \pi_s(u) - \lambda \pi_s(u)\ln \pi_s(u)] du ds,$

where $\pi$ denotes the control distribution. The entropic term ensures nondegenerate, smooth policies, enabling gradient methods and policy improvement algorithms.

Analytical Results: Well-Posedness and Convergence

The entropy-regularized penalization BSDE exhibits a unique solution for fixed $(\lambda, n)$ under mild assumptions. Key findings:

As $\lambda \downarrow 0$ (temperature decreases), with appropriate scaling $n = 1/\lambda$ , $V^{\lambda, n}$ converges uniformly to the classical American option value $V$ (Theorem~3.1).
Quantitative convergence rates are established: for bounded payoffs and Brownian filtration, $|V^n_t - V^{\lambda,n}_t| \leq C(\lambda - \lambda\ln\lambda)$ uniformly.
The process admits a monotonic policy improvement algorithm (PIA), improving accuracy with each iteration and converging to the entropy-regularized value at an explicitly controlled rate.

Limiting Regimes and Singular RBSDEs

The regime $n \to \infty$ (unbounded penalization), with fixed $\lambda$ , yields a limiting RBSDE:

$V^\lambda_t = P_T - \int_t^T dM^\lambda_s + \int_t^T \lambda \ln \left(\frac{\lambda}{V^\lambda_s - P_s}\right) ds + (A^\lambda_T - A^\lambda_t),$

subject to Skorokhod reflection conditions. The generator is logarithmically singular as $V^\lambda$ approaches the payoff barrier $P$ , a structure not previously analyzed in literature (contrasting with quadratic and other singular BSDE frameworks). The monotone limit argument ensures existence and uniqueness, leveraging comparisons and boundedness results regardless of quadratic growth.

Probabilistic and Financial Interpretation

The limiting process $V^\lambda$ corresponds to an American option under endogenous default risk, where default intensity $\gamma^\lambda_s$ is a function of $V^\lambda_s - P_s$ :

$\gamma^\lambda_s = \frac{\lambda}{P_s + \lambda - V^\lambda_s} \ln\left(\frac{\lambda}{V^\lambda_s - P_s}\right)$

Default (exercise with recovery) occurs at a Cox time determined by $\gamma^\lambda$ . The value process solves a modified stopping problem:

$V_t^\lambda = \esssup_{\tau_t \in \mathcal{T}_{t,T}} \mathbb{E}[P_{\tau_t} \mathbf{1}_{\{\sigma_t^\lambda > \tau_t\}} + (P_{\sigma_t^\lambda} + \lambda) \mathbf{1}_{\{\sigma_t^\lambda \leq \tau_t\}} | \mathcal{F}_t],$

As $\lambda \to 0$ , default intensity vanishes and the classical option value is recovered.

Numerical Methodology and Results

For practical computation, the entropy-regularized scheme is implemented via regression-based least-squares Monte Carlo and implicit time stepping for the BSDEs. The PIA is incorporated for accelerated convergence. The numerical experiments on a 2D American max-call option show:

Both the entropy-regularized BSDE solver and PIA converge efficiently to benchmark binomial tree prices, with improved stability and monotonicity over classical penalization.
The classical penalization exhibits slower convergence and systematic over-estimation for moderate penalization levels.
Entropy regularization and PIA are shown to be effective for RL-based option pricing, especially for moderate truncation parameters.

Significance and Implications

The entropy-regularized penalization approach provides a rigorous, model-independent framework for continuous-time optimal stopping, robust to degeneracies of classical optimal controls. The identification and analysis of limiting singular RBSDEs substantially extend the theoretical landscape, offering new insights for risk-sensitive control and pricing under endogenous shocks.

Contradictory to prevailing analyses, the paper demonstrates existence and uniqueness of RBSDEs with logarithmic singular drivers without the necessity of quadratic growth or domination techniques. The financial interpretation via defaultable options reveals novel connections between regularization, exploration incentives, and barrier reflection.

From a practical standpoint, entropy regularization enables gradient-based RL and policy improvement for optimal stopping, scalable to higher dimensions and generic market specifications. The convergence and stability improvements observed numerically suggest significant benefits in RL-based algorithm design.

Future Directions

Extension to high-dimensional settings and non-Markovian payoffs using neural approximators and deep BSDE solvers.
Adaptation of entropy-regularized penalization to multi-barrier and Dynkin game settings, relevant for credit derivatives and real options.
Analysis of further classes of singular generators in reflected BSDEs, especially with applications to stochastic control and reinforcement learning in finance.

Conclusion

The paper provides a comprehensive analysis of entropy-regularized penalization schemes for American option pricing, both from probabilistic and numerical perspectives. The novel characterization of singular RBSDEs, practical convergence rates, and robust algorithmic implementations establish this framework as a technically viable and theoretically significant approach for optimal stopping and RL applications in financial mathematics.

Markdown Report Issue