Policy-Gradient Dividend Optimisation
- The paper introduces policy-gradient methods that leverage reinforcement learning to optimize expected discounted dividend payouts in stochastic insurance surplus models.
- It details mathematical formulations using both randomised controls with entropy regularisation and band strategies, enabling unbiased gradient estimation.
- Empirical validations and theoretical guarantees demonstrate rapid convergence and practical advantages in complex claim and ruin scenarios.
Policy-gradient techniques for dividend optimisation address the challenge of maximising expected discounted dividend payouts in stochastic insurance surplus models with ruin, using parameterised or learned control policies. These methods transpose ideas from reinforcement learning (RL) and classical stochastic control, leveraging continuous or band-type policy parameterisations and stochastic gradient ascent—either analytically or via simulation—to approximate or identify optimal payout strategies. The development of efficient policy-gradient approaches has enabled robust, model-free optimisation in settings where model coefficients, claim-size distributions, or claim-arrival processes are complex or not amenable to classical analysis.
1. Mathematical Formulation of the Dividend Optimisation Problem
The canonical formulation is set in either a classical Cramér–Lundberg model or a diffusion (Brownian) surplus process. Let denote the insurer's surplus at time under a dividend-control policy . For the diffusion setting (Hamdouche et al., 2023, Bai et al., 2023):
where , , and is the admissible (possibly randomised) dividend rate at time . Ruin occurs at .
The objective is to maximise the expected discounted value of cumulative dividends up to ruin:
In the Cramér–Lundberg framework, the surplus evolves according to:
The dividend strategy is a nondecreasing adapted process, with and (Albrecher et al., 2022).
2. Policy Parameterisation and Randomisation
Policy-gradient techniques require differentiable parameterisations of admissible policies. Two RL-inspired approaches predominate:
- Randomised/relaxed control with entropy regularisation: Policies are parameterised by densities over , inducing randomized payouts and an entropy-regularized reward
Maximisation is over all progressively measurable densities (Bai et al., 2023).
- Band strategies as parametric policies: In the Cramér–Lundberg model, the classical optimality of band strategies allows viewing the control as defined by thresholds associated with inaction regions and payout zones (Albrecher et al., 2022). The policy is thus finite-dimensional and deterministic, but can be interpreted as a parametric RL policy with given by the band locations.
Randomisation serves to smooth the optimisation landscape, enabling unbiased gradient estimation in the presence of non-differentiable events such as ruin (Hamdouche et al., 2023, Bai et al., 2023).
3. Gradient Computation and Policy-Gradient Theorems
For parametric policies , gradient ascent is used to optimise the performance objective . In the randomised/entropy-regularised setting, the gradient admits REINFORCE-type representations:
with a control variate or baseline for variance reduction (Bai et al., 2023).
For band policies, the mapping from thresholds to expected reward is smooth (under regularity on the claim size law), admitting analytic expressions for , via scale functions and Gerber–Shiu deficit densities (Albrecher et al., 2022). These gradients can be used in Newton-type or quasi-Newton root-finding routines for optimal band identification.
In the simulation-based, model-free regime, empirical gradient estimates are constructed via likelihood-ratio tricks:
with over sampled trajectories (Hamdouche et al., 2023).
4. Algorithmic Schemes: REINFORCE, Actor–Critic, and Policy Improvement
Policy-gradient algorithms are implemented in several families:
- Direct Policy-Gradient (REINFORCE-EXIT) Algorithms: Trajectories are sampled under the current policy; return-weighted score-function gradients are computed and parameters updated by stochastic gradient ascent. This is unbiased and model-free (Hamdouche et al., 2023).
- Actor–Critic Algorithms: Online temporal-difference style updates are performed using a trained critic to estimate the value function and compute advantage signals (Hamdouche et al., 2023). Updates are made for both actor and critic parameters after each step.
- Entropy-Regularized Policy Improvement: Starting from an initial policy, repeated cycles of policy evaluation (using martingale-loss or temporal-difference learning) and policy improvement (analytic Gibbs-form updates) are performed, with theoretical monotonic improvement and global convergence guarantees (Bai et al., 2023).
- Gradient-Based Root-Finding for Deterministic Bands: For finite-band strategies, a Newton-type or local root-finding is used for the last two band parameters at each step, ascending the analytical gradient until the verification condition (integro-differential generator inequality) is satisfied (Albrecher et al., 2022).
Numerical stability and convergence rely on careful discretisation, accurate interpolation of scale functions and deficit densities, and judicious choice of hyperparameters (Albrecher et al., 2022, Hamdouche et al., 2023).
5. Numerical Performance and Empirical Validation
Empirical studies demonstrate rapid and accurate identification of optimal dividend strategies, confirming both the theoretical performance and computational efficiency of policy-gradient methods.
| Case | Method | Time | Bands | Values |
|---|---|---|---|---|
| Erlang(2,1) | Gradient | <1 s | 2 | (1.803, 10.216) |
| ES | ≈1 000 s | 2 | (1.806, 10.215) | |
| Pareto(1.5,1) Barrier | Gradient | <1 s | 1 | 2.71036 |
| ES | ≈1 000 s | 1 | 2.71036 | |
| Erlang–Pareto mixture | Gradient | ≈3 600 s | 2 | (0.0053, 3.8877) |
| ES | ≈28 800 s | 2 | (0.1524, 3.5115) |
In diffusion models, both value function and critical band (barrier) parameter estimates converge to known analytic benchmarks as discretisation is refined (Bai et al., 2023). In the Cramér–Lundberg setting, the gradient method matches or exceeds evolutionary strategies by several orders of magnitude in computational time (Albrecher et al., 2022).
6. Extensions and Theoretical Guarantees
Policy-gradient methods generalise to models with:
- Unknown or non-parametric model coefficients: Model-free algorithms remain applicable by employing relaxed/entropy-regularised reward structures and sampling (Bai et al., 2023).
- Spectrally negative Lévy processes: Using Laplace inversion for scale functions, analogous root-finding applies (Albrecher et al., 2022).
- Higher-dimensional or path-dependent controls: Actor–critic and policy-gradient mechanisms remain tractable for neural parameterisations of policies or value functions (Hamdouche et al., 2023, Aubert et al., 24 Nov 2025).
- Self-exciting claim processes (Hawkes): RL-based policy-gradient techniques remain robust and match PDE-based benchmarks, even under claim clustering effects (Aubert et al., 24 Nov 2025).
Theoretical results confirm global convergence (monotone improvement) for entropy-regularised policy improvement and existence and uniqueness of bounded classical or viscosity solutions to the associated HJB equations (Bai et al., 2023, Aubert et al., 24 Nov 2025).
7. Connections, Limitations, and Practical Guidance
The policy-gradient approach unifies model-based analytic optimisation and model-free RL. Analytic gradients enable fast and robust optimisation for low-dimensional, band-type policies. In simulation-based or high-dimensional settings, policy-gradient RL provides a feasible alternative. Limitations include the requirement of smoothness for analytic gradients and variance in simulation-based estimation (Albrecher et al., 2022).
Diagnostic recommendations include monitoring objective convergence, comparing to analytic or classical barrier strategies, and inspecting policy behaviour to confirm alignment with known structural properties (e.g., thresholding and bang–bang payouts) (Hamdouche et al., 2023).
The integration of policy-gradient machinery with stochastic control for dividend optimisation has considerably widened the class of tractable problems, enabling solutions in more general, realistic insurance risk models and supporting scalable computation for practical applications (Albrecher et al., 2022, Hamdouche et al., 2023, Bai et al., 2023, Aubert et al., 24 Nov 2025).