- The paper introduces entropy regularization in American option valuation, reformulating the optimal stopping problem via reflected BSDEs with singular generators.
- It establishes well-posedness and provides quantitative convergence rates under various penalization regimes, including limiting cases with logarithmic singularities.
- Numerical experiments confirm that the entropy-regularized scheme and policy improvement algorithm converge efficiently, offering stability over classical methods.
Entropy-Regularized Penalization Schemes for American Options and Reflected BSDEs with Singular Generators
Overview
This paper (2602.18078) analyzes the application of entropy-regularized penalization schemes to the valuation of American options via reflected backward stochastic differential equations (RBSDEs), focusing on continuous-time optimal stopping with singular generator structures. The work introduces a probabilistic, model-agnostic approach leveraging entropy-regularization both for theoretical regularity and for algorithmic feasibility within reinforcement learning frameworks. It establishes well-posedness and convergence for various penalization regimes, quantifies approximation rates, and elucidates the limiting behavior as penalization parameters diverge, including the emergence of RBSDEs with logarithmic singularities. Numerical experiments highlight practical implementation, and the theoretical analysis expands the singular RBSDE literature.
The classical optimal stopping problem for American options, with value process Vt, is reframed using randomized stopping via stopping intensities (γ), following Gyöngy and Šiška. The associated BSDE is:
$V_t = P_T - (M_T - M_t) + \esssup_{\gamma \in \Lambda} \int_t^T (P_s - V_s)\gamma_s ds,$
where optimal γ assumes degenerate values ($0$ or ∞), introducing analytical and numerical difficulties for gradient-based RL methods. The paper proposes truncating γ to [0,n] and applies classical penalization:
Vtn=PT−(MTn−Mtn)+∫tTn(Ps−Vsn)+ds,
then extends to entropy-regularized controls (for temperature parameter λ):
$V^{\lambda,n}_t = P_T - (M^{\lambda,n}_T - M^{\lambda,n}_t) + \esssup_{\pi \in \Pi_n} \int_t^T \int_0^n [(P_s - V^{\lambda,n}_s)u \pi_s(u) - \lambda \pi_s(u)\ln \pi_s(u)] du ds,$
where π denotes the control distribution. The entropic term ensures nondegenerate, smooth policies, enabling gradient methods and policy improvement algorithms.
Analytical Results: Well-Posedness and Convergence
The entropy-regularized penalization BSDE exhibits a unique solution for fixed (λ,n) under mild assumptions. Key findings:
- As λ↓0 (temperature decreases), with appropriate scaling n=1/λ, Vλ,n converges uniformly to the classical American option value V (Theorem~3.1).
- Quantitative convergence rates are established: for bounded payoffs and Brownian filtration, ∣Vtn−Vtλ,n∣≤C(λ−λlnλ) uniformly.
- The process admits a monotonic policy improvement algorithm (PIA), improving accuracy with each iteration and converging to the entropy-regularized value at an explicitly controlled rate.
Limiting Regimes and Singular RBSDEs
The regime n→∞ (unbounded penalization), with fixed λ, yields a limiting RBSDE:
Vtλ=PT−∫tTdMsλ+∫tTλln(Vsλ−Psλ)ds+(ATλ−Atλ),
subject to Skorokhod reflection conditions. The generator is logarithmically singular as Vλ approaches the payoff barrier P, a structure not previously analyzed in literature (contrasting with quadratic and other singular BSDE frameworks). The monotone limit argument ensures existence and uniqueness, leveraging comparisons and boundedness results regardless of quadratic growth.
Probabilistic and Financial Interpretation
The limiting process Vλ corresponds to an American option under endogenous default risk, where default intensity γsλ is a function of Vsλ−Ps:
γsλ=Ps+λ−Vsλλln(Vsλ−Psλ)
Default (exercise with recovery) occurs at a Cox time determined by γλ. The value process solves a modified stopping problem:
$V_t^\lambda = \esssup_{\tau_t \in \mathcal{T}_{t,T}} \mathbb{E}[P_{\tau_t} \mathbf{1}_{\{\sigma_t^\lambda > \tau_t\}} + (P_{\sigma_t^\lambda} + \lambda) \mathbf{1}_{\{\sigma_t^\lambda \leq \tau_t\}} | \mathcal{F}_t],$
As λ→0, default intensity vanishes and the classical option value is recovered.
Numerical Methodology and Results
For practical computation, the entropy-regularized scheme is implemented via regression-based least-squares Monte Carlo and implicit time stepping for the BSDEs. The PIA is incorporated for accelerated convergence. The numerical experiments on a 2D American max-call option show:
- Both the entropy-regularized BSDE solver and PIA converge efficiently to benchmark binomial tree prices, with improved stability and monotonicity over classical penalization.
- The classical penalization exhibits slower convergence and systematic over-estimation for moderate penalization levels.
- Entropy regularization and PIA are shown to be effective for RL-based option pricing, especially for moderate truncation parameters.
Significance and Implications
The entropy-regularized penalization approach provides a rigorous, model-independent framework for continuous-time optimal stopping, robust to degeneracies of classical optimal controls. The identification and analysis of limiting singular RBSDEs substantially extend the theoretical landscape, offering new insights for risk-sensitive control and pricing under endogenous shocks.
Contradictory to prevailing analyses, the paper demonstrates existence and uniqueness of RBSDEs with logarithmic singular drivers without the necessity of quadratic growth or domination techniques. The financial interpretation via defaultable options reveals novel connections between regularization, exploration incentives, and barrier reflection.
From a practical standpoint, entropy regularization enables gradient-based RL and policy improvement for optimal stopping, scalable to higher dimensions and generic market specifications. The convergence and stability improvements observed numerically suggest significant benefits in RL-based algorithm design.
Future Directions
- Extension to high-dimensional settings and non-Markovian payoffs using neural approximators and deep BSDE solvers.
- Adaptation of entropy-regularized penalization to multi-barrier and Dynkin game settings, relevant for credit derivatives and real options.
- Analysis of further classes of singular generators in reflected BSDEs, especially with applications to stochastic control and reinforcement learning in finance.
Conclusion
The paper provides a comprehensive analysis of entropy-regularized penalization schemes for American option pricing, both from probabilistic and numerical perspectives. The novel characterization of singular RBSDEs, practical convergence rates, and robust algorithmic implementations establish this framework as a technically viable and theoretically significant approach for optimal stopping and RL applications in financial mathematics.