Policy-Gradient Dividend Optimisation

Updated 27 November 2025

The paper introduces policy-gradient methods that leverage reinforcement learning to optimize expected discounted dividend payouts in stochastic insurance surplus models.
It details mathematical formulations using both randomised controls with entropy regularisation and band strategies, enabling unbiased gradient estimation.
Empirical validations and theoretical guarantees demonstrate rapid convergence and practical advantages in complex claim and ruin scenarios.

Policy-gradient techniques for dividend optimisation address the challenge of maximising expected discounted dividend payouts in stochastic insurance surplus models with ruin, using parameterised or learned control policies. These methods transpose ideas from reinforcement learning (RL) and classical stochastic control, leveraging continuous or band-type policy parameterisations and stochastic gradient ascent—either analytically or via simulation—to approximate or identify optimal payout strategies. The development of efficient policy-gradient approaches has enabled robust, model-free optimisation in settings where model coefficients, claim-size distributions, or claim-arrival processes are complex or not amenable to classical analysis.

1. Mathematical Formulation of the Dividend Optimisation Problem

The canonical formulation is set in either a classical Cramér–Lundberg model or a diffusion (Brownian) surplus process. Let $X_t$ denote the insurer's surplus at time $t$ under a dividend-control policy $u_t$ . For the diffusion setting (Hamdouche et al., 2023, Bai et al., 2023):

$dX_t = (\mu - u_t) dt + \sigma dW_t,\qquad X_0 = x \geq 0$

where $\mu > 0$ , $\sigma > 0$ , and $u_t \in [0, a]$ is the admissible (possibly randomised) dividend rate at time $t$ . Ruin occurs at $\tau = \inf\{ t > 0: X_t < 0 \}$ .

The objective is to maximise the expected discounted value of cumulative dividends up to ruin:

$V(x) = \sup_{u_\cdot \in \mathcal{A}_{[0,a]}} \mathbb{E}\Big[\int_0^{\tau} e^{-c t} u_t dt \,\Big|\, X_0 = x\Big]$

In the Cramér–Lundberg framework, the surplus evolves according to:

$C_t = u + p t - \sum_{k=1}^{N_t} Y_k,\qquad p > 0,\; N_t \sim \mathrm{Poisson}(\lambda),\; Y_k\ \text{i.i.d. claims}$

The dividend strategy $U_t$ is a nondecreasing adapted process, with $X_t = C_t - U_t$ and $V_\pi(u) = \mathbb{E}\left[\int_0^\tau e^{-\delta t} dU_t\,|\,X_0 = u\right]$ (Albrecher et al., 2022).

2. Policy Parameterisation and Randomisation

Policy-gradient techniques require differentiable parameterisations of admissible policies. Two RL-inspired approaches predominate:

Randomised/relaxed control with entropy regularisation: Policies are parameterised by densities $\pi_t(w)$ over $[0, a]$ , inducing randomized payouts and an entropy-regularized reward

$r^\pi(t) = \int_0^a (w - \lambda \ln \pi_t(w))\,\pi_t(w)\,dw$

Maximisation is over all progressively measurable densities $\pi_t(w)$ (Bai et al., 2023).

Band strategies as parametric policies: In the Cramér–Lundberg model, the classical optimality of band strategies allows viewing the control as defined by thresholds $(b_0, a_1, b_1, ..., a_{m-1}, b_{m-1})$ associated with inaction regions and payout zones (Albrecher et al., 2022). The policy is thus finite-dimensional and deterministic, but can be interpreted as a parametric RL policy $\pi_\theta$ with $\theta$ given by the band locations.

Randomisation serves to smooth the optimisation landscape, enabling unbiased gradient estimation in the presence of non-differentiable events such as ruin (Hamdouche et al., 2023, Bai et al., 2023).

3. Gradient Computation and Policy-Gradient Theorems

For parametric policies $\pi_\theta(u|x)$ , gradient ascent is used to optimise the performance objective $J(\theta)$ . In the randomised/entropy-regularised setting, the gradient admits REINFORCE-type representations:

$\nabla_\theta J(\theta) = \mathbb{E}\left[ \int_0^\tau e^{-c t} \big(r^\pi(t) - \mathcal{V}(X_t)\big)\, \nabla_\theta \ln \pi_\theta(a_t|X_t)\, dt \right]$

with $\mathcal{V}$ a control variate or baseline for variance reduction (Bai et al., 2023).

For band policies, the mapping from thresholds to expected reward $V^m(b_0, a_1, ..., b_{m-1})$ is smooth (under regularity on the claim size law), admitting analytic expressions for $\partial V/\partial b_k$ , $\partial V/\partial a_k$ via scale functions $W_\delta$ and Gerber–Shiu deficit densities $f_D$ (Albrecher et al., 2022). These gradients can be used in Newton-type or quasi-Newton root-finding routines for optimal band identification.

In the simulation-based, model-free regime, empirical gradient estimates are constructed via likelihood-ratio tricks:

$\widehat{g} = \frac{1}{K} \sum_{k=1}^K G^{(k)} \sum_{i: t_i<\tau^{(k)}} \nabla_\theta \log \pi_\theta(u_i^{(k)}|X_i^{(k)})$

with $G^{(k)} = \sum_{i: t_i<\tau^{(k)}} e^{-c t_i} u_i^{(k)} \Delta t$ over $K$ sampled trajectories (Hamdouche et al., 2023).

4. Algorithmic Schemes: REINFORCE, Actor–Critic, and Policy Improvement

Policy-gradient algorithms are implemented in several families:

Direct Policy-Gradient (REINFORCE-EXIT) Algorithms: Trajectories are sampled under the current policy; return-weighted score-function gradients are computed and parameters updated by stochastic gradient ascent. This is unbiased and model-free (Hamdouche et al., 2023).
Actor–Critic Algorithms: Online temporal-difference style updates are performed using a trained critic $V_w(x)$ to estimate the value function and compute advantage signals (Hamdouche et al., 2023). Updates are made for both actor and critic parameters after each step.
Entropy-Regularized Policy Improvement: Starting from an initial policy, repeated cycles of policy evaluation (using martingale-loss or temporal-difference learning) and policy improvement (analytic Gibbs-form updates) are performed, with theoretical monotonic improvement and global convergence guarantees (Bai et al., 2023).
Gradient-Based Root-Finding for Deterministic Bands: For finite-band strategies, a Newton-type or local root-finding is used for the last two band parameters at each step, ascending the analytical gradient until the verification condition (integro-differential generator inequality) is satisfied (Albrecher et al., 2022).

Numerical stability and convergence rely on careful discretisation, accurate interpolation of scale functions and deficit densities, and judicious choice of hyperparameters (Albrecher et al., 2022, Hamdouche et al., 2023).

5. Numerical Performance and Empirical Validation

Empirical studies demonstrate rapid and accurate identification of optimal dividend strategies, confirming both the theoretical performance and computational efficiency of policy-gradient methods.

Case	Method	Time	Bands	Values
Erlang(2,1)	Gradient	<1 s	2	(1.803, 10.216)
	ES	≈1 000 s	2	(1.806, 10.215)
Pareto(1.5,1) Barrier	Gradient	<1 s	1	2.71036
	ES	≈1 000 s	1	2.71036
Erlang–Pareto mixture	Gradient	≈3 600 s	2	(0.0053, 3.8877)
	ES	≈28 800 s	2	(0.1524, 3.5115)

In diffusion models, both value function and critical band (barrier) parameter estimates converge to known analytic benchmarks as discretisation is refined (Bai et al., 2023). In the Cramér–Lundberg setting, the gradient method matches or exceeds evolutionary strategies by several orders of magnitude in computational time (Albrecher et al., 2022).

6. Extensions and Theoretical Guarantees

Policy-gradient methods generalise to models with:

Unknown or non-parametric model coefficients: Model-free algorithms remain applicable by employing relaxed/entropy-regularised reward structures and sampling (Bai et al., 2023).
Spectrally negative Lévy processes: Using Laplace inversion for scale functions, analogous root-finding applies (Albrecher et al., 2022).
Higher-dimensional or path-dependent controls: Actor–critic and policy-gradient mechanisms remain tractable for neural parameterisations of policies or value functions (Hamdouche et al., 2023, Aubert et al., 24 Nov 2025).
Self-exciting claim processes (Hawkes): RL-based policy-gradient techniques remain robust and match PDE-based benchmarks, even under claim clustering effects (Aubert et al., 24 Nov 2025).

Theoretical results confirm global convergence (monotone improvement) for entropy-regularised policy improvement and existence and uniqueness of bounded classical or viscosity solutions to the associated HJB equations (Bai et al., 2023, Aubert et al., 24 Nov 2025).

7. Connections, Limitations, and Practical Guidance

The policy-gradient approach unifies model-based analytic optimisation and model-free RL. Analytic gradients enable fast and robust optimisation for low-dimensional, band-type policies. In simulation-based or high-dimensional settings, policy-gradient RL provides a feasible alternative. Limitations include the requirement of smoothness for analytic gradients and variance in simulation-based estimation (Albrecher et al., 2022).

Diagnostic recommendations include monitoring objective convergence, comparing to analytic or classical barrier strategies, and inspecting policy behaviour to confirm alignment with known structural properties (e.g., thresholding and bang–bang payouts) (Hamdouche et al., 2023).

The integration of policy-gradient machinery with stochastic control for dividend optimisation has considerably widened the class of tractable problems, enabling solutions in more general, realistic insurance risk models and supporting scalable computation for practical applications (Albrecher et al., 2022, Hamdouche et al., 2023, Bai et al., 2023, Aubert et al., 24 Nov 2025).