- The paper introduces the Loss Dirichlet Multinomial Mixture (LDMM) model for insurance claim prediction, integrating both structured claim amounts and unstructured textual descriptions.
- LDMM creates a probabilistic link between claim descriptions and loss amounts, enhancing claims clustering and improving the ability to capture complex loss characteristics like multimodality and heavy tails.
- The model, calibrated using EM or MH-within-Gibbs algorithms, allows for estimating risk measures like VaR and CTE, providing a new approach for individual claims reserving by leveraging textual data.
This paper introduces a novel topic-based finite mixture model, the Loss Dirichlet Multinomial Mixture (LDMM), for insurance claim prediction by combining both structured (claim amounts) and unstructured (claim descriptions) data. The core idea is to establish a probabilistic link between textual descriptions and loss amounts, thereby enhancing the accuracy of claims clustering and prediction.
The LDMM model posits that a claim case can be represented as a triplet (Y,D,Z), where Y is the claim loss, D is the textual claim description, and Z is an unobserved categorical variable indicating the topic or component. The model assumes that, conditioned on the topic/component indicator Z, the claim description D follows a multinomial distribution, while the claim amount Y follows a component loss distribution. The words Dj​ in a document are assumed to be generated from a Dirichlet Multinomial Mixture (DMM) model.
The joint distribution is given by:
Zi​∣θ∼Dis(θ),i=1,…,n,
Yi​∣Zi​,Φ∼pZi​​(ϕZi​​),i=1,…,n,
Y0
Y1
Y2
Y3
where:
- Y4 is the topic indicator for the Y5-th claim.
- Y6 is the parameter vector for the discrete distribution of topics.
- Y7 denotes the discrete distribution.
- Y8 is the claim amount for the Y9-th claim.
- D0 is the set of parameters for the component loss distributions.
- D1 is the probability density function of the loss distribution for topic D2.
- D3 are the parameters of the D4 distribution.
- D5 is the D6-th word in the claim description of the D7-th claim.
- D8 is the set of parameters for the multinomial distributions of words given topics.
- D9 is the parameter vector for the multinomial distribution of words given topic Z0.
- Z1 denotes the Dirichlet distribution.
- Z2 is the hyperparameter vector for the Dirichlet prior on Z3.
- Z4 is the prior distribution for the loss distribution parameters in component Z5.
- Z6 are the hyperparameters for the prior distribution Z7.
- Z8 is the hyperparameter vector for the Dirichlet prior on Z9.
- Z0 is the number of topics/components.
- Z1 is the number of claims.
- Z2 is the length of the Z3-th document.
Two methods are proposed for model calibration:
- An Expectation-Maximization (EM) algorithm for maximum a posteriori (MAP) estimates.
- An Metropolis-Hastings (MH)-within-Gibbs sampler algorithm for the posterior distribution.
The EM algorithm is used to obtain the MAP estimates of the parameters, while the MH-within-Gibbs sampler is employed to estimate the posterior distribution. The full conditional distributions for the latent variable Z4, the topic distribution Z5, and the word distributions Z6 are derived, facilitating the implementation of the Gibbs sampler. For the loss distribution parameters Z7, when conjugate priors are unavailable, the MH algorithm is used to sample from the full conditional distribution.
Model selection is performed using four metrics: the deviance information criterion (DIC), Wasserstein distance, perplexity, and stability. The DIC is used to evaluate the goodness-of-fit of the entire model. The Wasserstein distance evaluates the goodness-of-fit of the finite mixture model for the claim loss, and perplexity and stability evaluate the performance of the DMM for the claim description.
The DIC is defined as:
Z8
where:
- Z9 is the effective number of parameters.
- D0 is the posterior expectation of D1.
- D2 is the posterior expectation of D3.
- D4 is the deviance.
- D5 indicates all the parameters.
The Wasserstein distance with index D6 between two probability measure D7 and D8 is defined by:
D9
where:
- Y0 and Y1 are the probability measures on the real line.
- Y2 is the set of couplings between Y3 and Y4.
Perplexity is calculated as:
Y5
where:
- Y6 denotes the set of test indices.
- Y7 is the number of words in the Y8-th document.
The stability of parameters for topic Y9 is defined as:
Dj​0
where:
- Dj​1 is a vector similarity function.
- Dj​2 is the posterior mean of Dj​3.
The posterior predictive distribution of the claim loss Dj​4 given the claim description Dj​5 is:
Dj​6
The paper estimates the risk measures Value-at-Risk (VaR) and Conditional Tail Expectation (CTE) for reported but not settled (RBNS) claims. VaR quantifies the maximum loss that a portfolio might suffer over a given time frame, at a given confidence level:
Dj​7
where:
- Dj​8 follows the empirical distribution of Dj​9.
- Zi​∣θ∼Dis(θ),i=1,…,n,0 is the confidence level.
CTE is a risk measure that considers the average of losses above the VaR threshold:
Zi​∣θ∼Dis(θ),i=1,…,n,1
where:
- Zi​∣θ∼Dis(θ),i=1,…,n,2 is the Zi​∣θ∼Dis(θ),i=1,…,n,3-th simulated sample of the claim loss.
The empirical study used a realistic synthetic dataset of 90,000 worker compensation insurance policies. The dataset includes claim amounts and textual claim descriptions. The claim amount distribution exhibits multimodality, right skewness, and thick tails. The claim descriptions are short texts, typically ranging from 1 to 14 words. The authors performed pre-processing steps such as lowercasing, stemming, lemmatization, and stop word removal on the textual data.
The paper presents results for various LDMM models with different numbers of components (Zi​∣θ∼Dis(θ),i=1,…,n,4) and different component loss distributions, including log-normal, GB2, and Pareto distributions. The MAP estimates of the parameters are obtained using the EM algorithm, and the posterior distributions are estimated using the MH-within-Gibbs sampler.
The results indicate that the LDMM model can effectively capture the complex characteristics of claim amounts. The model selection results suggest that a mixture of 3 or 4 components is sufficient to capture the complexity in the claims amount distribution. The component analysis reveals that the LDMM model can decompose the empirical loss distribution into components with different characteristics, which can be explained by the corresponding claim description topics.
For example, in a 2-component model, one component may be associated with severe claims related to back and shoulder injuries, while the other component may be associated with mild claims related to finger and eye injuries. In a 4-component model, the components can be further differentiated based on the types of injuries and the circumstances of the accident.
The paper concludes that the LDMM model represents an advancement in individual claims reserving by effectively integrating textual claim descriptions and can capture complex characteristics of claims amount, including their multimodality, skewness, and heavy-tailed nature.