Bermudan Game Options: Pricing & Hedging

Updated 24 September 2025

Bermudan game options are discrete-time contingent claims that combine early exercise rights with game-theoretic cancellation features.
Their valuation leverages techniques such as entropy-regularized reflected BSDEs and reinforcement learning to address the challenges of discontinuous optimal stopping.
Dual representations and numerical algorithms incorporating transaction costs and market frictions offer robust frameworks for effective pricing and hedging.

Bermudan game options are discrete-time two-player contingent claims that combine the early exercise features of Bermudan options with game-theoretic features, notably an issuer’s right of cancellation as in Israeli or game options. The valuation, hedging, and numerical analysis of Bermudan game options present foundational challenges in discrete-time optimal stopping theory, stochastic control, and reflected backward stochastic differential equations (RBSDEs), especially under market imperfections such as transaction costs and risk constraints. Modern approaches to these derivatives leverage entropy regularization, duality, reinforcement learning, and deep learning to yield robust pricing, hedging, and risk management frameworks.

1. Mathematical Framework: Bermudan Game Options as Discrete Dynkin Games

A Bermudan game option is defined by a finite set $\mathcal{S}$ of exercise (and, for the seller, cancellation) dates. The buyer selects an exercise time $\tau$ from $\mathcal{S}$ ; the seller (counterparty) selects a cancellation time $\sigma$ (also in $\mathcal{S}$ ). The payoff process is specified via $(P_t, R_t)_{t\in\mathcal{S}}$ , where $P_t$ is the buyer's payoff upon exercise and $R_t$ is the (typically higher) payoff delivered if the seller cancels. The contract payoff to the buyer is, for example,

$Q_{\tau,\sigma} = P_{\tau} \mathbf{1}_{\{\tau \leq \sigma\}} + R_{\sigma} \mathbf{1}_{\{\sigma < \tau\}}.$

This leads to a zero-sum discrete Dynkin game where both players can act only at the discrete exercise dates. Extensions to non-zero-sum or non-linear assessment functionals are covered in the literature on non-linear non-zero-sum games with Bermudan strategies (Grigorova et al., 2023).

The pricing and hedging problem is formulated in terms of recursive constructions or reflected BSDEs with double obstacles, often under market frictions such as proportional transaction costs or in incomplete markets, as in multi-currency settings (Roux et al., 2011, Roux, 2015). The presence of discrete exercise/cancellation times implies that all recursive constructions, dualities, and hedging algorithms must be conducted over time-grids, making the Bermudan setting fundamentally different from fully continuous-time games.

2. Entropy-Regularized RBSDEs and Policy Improvement

The entropy-regularized BSDE framework (Frikha et al., 23 Sep 2025) addresses the main computational bottleneck in Dynkin/Bermudan games: the discontinuous and non-differentiable structure of the optimal stopping policies, which appears as a "bang–bang" control in classical reflected BSDEs. The entropy penalty, governed by a temperature parameter $\lambda>0$ , perturbs the sharp control into a randomized stopping density, leading to a smoother problem: $V_t^\lambda = P_T - (M^\lambda_T - M^\lambda_t) + \sum_{t \le t_i < T} \lambda \Phi\bigg(\frac{P_{t_i}-V^\lambda_{t_{i+1}}}{\lambda}\bigg),$ where $\Phi(x) = x\Psi(x)$ , with $\Psi(x) = (1/x)\log\frac{e^x-1}{x}$ for $x \neq 0$ . In the game extension, the corresponding double obstacle reflected BSDE (DRBSDE) includes both the upper and lower rewards: $V_t^\lambda = P_T - \int_{]t,T]} dM_s^\lambda + \sum_{t \le t_i < T} \Big[ \lambda \Phi\Big(\frac{P_{t_i}-V^\lambda_{t_{i+1}}}{\lambda}\Big) - \lambda \Phi\Big(\frac{V^\lambda_{t_{i+1}}-R_{t_i}}{\lambda}\Big) \Big].$ The randomization induced by the entropy term not only regularizes the discontinuity in the stopping rule but also enables the application of smooth reinforcement learning algorithms.

A key result is that, as $\lambda \downarrow 0$ , $V_t^\lambda \uparrow V_t$ , achieving the true (classical) Bermudan game price. The error is quantified by

$0 \leq V_t - V_t^\lambda \leq C(N(t))(\lambda - \lambda \log\lambda),$

where $N(t)$ is the number of remaining exercise dates. This error control provides a practical guideline for choosing $\lambda$ in numerics.

3. Reinforcement Learning and Convergence of Policy Improvement

An efficient policy improvement algorithm is constructed in the entropy-regularized framework. The RL algorithm alternately updates the policies of each player (in the game setting: the minimizer for the issuer and maximizer for the holder) via closed-form Gibbs distributions: $\pi_{t_i}^*(u) = \frac{(P_{t_i}-V_{t_{i+1}}^\lambda)/\lambda}{\exp\left(\frac{P_{t_i}-V_{t_{i+1}}^\lambda}{\lambda}\right)-1} \exp\left( \frac{P_{t_i}-V_{t_{i+1}}^\lambda}{\lambda} u \right),\quad u\in[0,1].$ The value function is then recalculated backward via a temporal-difference recursion respecting the entropy-regularized martingale property. The scheme is provably monotonic and converges in at most as many iterations as there are exercise dates (Theorems "policy_convergence" and "policy_convergence_game" in (Frikha et al., 23 Sep 2025)), thus yielding the unique entropy-regularized value and stopping strategies for both players.

The explicit use of TD errors and Gibbs-form update renders the algorithm compatible with neural network value function approximation, as implemented in deep RL architectures.

4. Duality Representations, Super/Hedging, and Transaction Costs

The duality theory for Bermudan game options in markets with bid–ask spreads (Roux, 2015, Roux et al., 2011) yields minimax representations for the seller’s (ask) and buyer’s (bid) prices: $\pi^{ask}_i(Y, X, X') = \min_{\sigma \in D} \max_{(P,S) \in \mathcal{P}_i(\chi^\sigma)} \mathbb{E}_P[ (Q^\sigma \cdot S^\sigma)_{\chi^\sigma} ],$ with $D$ the discrete exercise/cancellation set, $Q^\sigma$ the payoff process (accounting for cancellation penalties), and $(P,S)$ approximate martingale pairs consistent with the solvency cones induced by transaction costs.

Recursive polyhedral set constructions (using intersections, unions, Minkowski addition) characterize the admissible hedging portfolios for both players. For instance, at terminal time $T$ : $Z_T = \{ X_T + K_T \}, \qquad Z_t = (V_t \cap W_t) \cup X_t,$ and analogous recursions for the buyer. These update algorithms generalize to strictly discrete (Bermudan) games by skipping non-admissible exercise dates.

The non-convexity of the buyer’s problem, arising from set unions in the recursion, leads to technical challenges in constructing optimal strategies—these are addressed in (Roux et al., 2011) via explicit dual representations, albeit at the cost of increased computational complexity.

5. Numerical Algorithms: Neural Networks, Policy Improvement, and TD-BSDE Methods

Recent advances leverage deep neural networks and temporal-difference learning for high-dimensional Bermudan game options (Frikha et al., 23 Sep 2025). Two classes of numerical schemes are implemented:

TD-based BSDE solver: The martingale representation is discretized using neural network-based approximations for the continuation value between exercise dates. The adjustment condition at each exercise date is enforced via the entropy-regularized formula:

$v^{(\lambda)}(t_i, x) = \mathcal{V}^{(\eta_i)}(x) + (P_{t_i}(x) - \mathcal{V}^{(\eta_i)}(x)) \Psi\left( \frac{P_{t_i}(x) - \mathcal{V}^{(\eta_i)}(x)}{\lambda} \right).$

Between exercise dates the value is trained by minimizing the squared TD error with Monte Carlo samples.

Policy improvement algorithm: Alternates between closed-form policy (Gibbs) updates for each player and value evaluation via backward recursion, ensuring monotone convergence (error is non-increasing per iteration and vanishes after the number of exercise dates).

Practical illustrations demonstrate that these neural algorithmic approaches efficiently recover theoretical values as $\lambda \to 0$ , providing scalable methods for problems that are otherwise intractable with classical finite-difference or tree-based methods.

6. Nonlinear, Nonzero-Sum, and Risk-Constrained Bermudan Games

Extensions to non-zero-sum and nonlinear assessment settings are covered in (Grigorova et al., 2023). Here, each agent’s payoff is assessed via a general (possibly concave) nonlinear functional (e.g., determined via risk measures or $g$ -expectations), leading to the dynamic programming recursion: $V^{2n+1}(S) = \esssup_{\tau \in \Theta_S} \rho^{(1)}_{S, \tau \wedge \tau^{(2n)}} \left[ X^1(\tau) \mathbf{1}_{\{\tau < \tau^{(2n)}\}} + Y^1(\tau^{(2n)}) \mathbf{1}_{\{\tau^{(2n)} \le \tau\}} \right].$ Alternating optimal stopping problems are solved with the opponent’s strategy "frozen," and the construction is shown (given suitable monotonicity and continuity conditions on the risk functionals) to converge to a Nash equilibrium, providing a robust theoretical foundation for risk-aware hedging and pricing in sophisticated game option contexts.

7. Impact, Limitations, and Research Directions

The entropy-regularized RBSDE and RL-based policy improvement approaches provide a unified and computationally tractable methodology for pricing and hedging Bermudan game options, transforming discrete stopping games into smooth control problems amenable to modern machine learning (e.g., neural networks, stochastic gradient optimization). The rigorous analysis ensures existence, uniqueness, error bounds, and finite convergence guarantees for both option values and optimal randomized stopping/cancellation strategies.

Transaction costs and market frictions are naturally incorporated through dual cone and martingale representations, ensuring applicability in realistic multi-currency markets with bid-ask spreads.

Extensions to high dimensions, non-linear evaluations, and risk management settings are directly supported via deep learning and policy iteration. However, the non-convexity of the buyer’s problem, especially under transaction costs, and the numerics for the nonzero-sum, non-linear cases, still present substantive implementation and theory challenges.

A plausible implication is that as lambda decreases and network capacity increases, the entropy-regularized policy improvement algorithms can provide effective hedging and pricing tools for a wide class of discrete-time stochastic games arising in modern financial engineering. Future work may focus on further scaling these algorithms (e.g., combining with transfer learning or parallelization), adapting to continuous-time settings, and refining dual representations for robust risk assessment.