Combining Structural and Unstructured Data: A Topic-based Finite Mixture Model for Insurance Claim Prediction
(2410.04684v1)
Published 7 Oct 2024 in stat.AP and cs.LG
Abstract: Modeling insurance claim amounts and classifying claims into different risk levels are critical yet challenging tasks. Traditional predictive models for insurance claims often overlook the valuable information embedded in claim descriptions. This paper introduces a novel approach by developing a joint mixture model that integrates both claim descriptions and claim amounts. Our method establishes a probabilistic link between textual descriptions and loss amounts, enhancing the accuracy of claims clustering and prediction. In our proposed model, the latent topic/component indicator serves as a proxy for both the thematic content of the claim description and the component of loss distributions. Specifically, conditioned on the topic/component indicator, the claim description follows a multinomial distribution, while the claim amount follows a component loss distribution. We propose two methods for model calibration: an EM algorithm for maximum a posteriori estimates, and an MH-within-Gibbs sampler algorithm for the posterior distribution. The empirical study demonstrates that the proposed methods work effectively, providing interpretable claims clustering and prediction.
Summary
The paper introduces the Loss Dirichlet Multinomial Mixture (LDMM) model for insurance claim prediction, integrating both structured claim amounts and unstructured textual descriptions.
LDMM creates a probabilistic link between claim descriptions and loss amounts, enhancing claims clustering and improving the ability to capture complex loss characteristics like multimodality and heavy tails.
The model, calibrated using EM or MH-within-Gibbs algorithms, allows for estimating risk measures like VaR and CTE, providing a new approach for individual claims reserving by leveraging textual data.
This paper introduces a novel topic-based finite mixture model, the Loss Dirichlet Multinomial Mixture (LDMM), for insurance claim prediction by combining both structured (claim amounts) and unstructured (claim descriptions) data. The core idea is to establish a probabilistic link between textual descriptions and loss amounts, thereby enhancing the accuracy of claims clustering and prediction.
The LDMM model posits that a claim case can be represented as a triplet (Y,D,Z), where Y is the claim loss, D is the textual claim description, and Z is an unobserved categorical variable indicating the topic or component. The model assumes that, conditioned on the topic/component indicator Z, the claim description D follows a multinomial distribution, while the claim amount Y follows a component loss distribution. The words Dj in a document are assumed to be generated from a Dirichlet Multinomial Mixture (DMM) model.
The joint distribution is given by:
Zi∣θ∼Dis(θ),i=1,…,n,
Yi∣Zi,Φ∼pZi(ϕZi),i=1,…,n,
Di,j∣Zi,Ψ∼Dis(ψZi),i=1,…,n,j=1,…,∣Di∣,
θ∼Dir(α),
ϕk∼qk(βk),k=1,…,K,
ψk∼Dir(γ),k=1,…,K,
where:
Zi is the topic indicator for the i-th claim.
θ is the parameter vector for the discrete distribution of topics.
Dis denotes the discrete distribution.
Yi is the claim amount for the i-th claim.
Φ is the set of parameters for the component loss distributions.
pZi is the probability density function of the loss distribution for topic Zi.
ϕZi are the parameters of the pZi distribution.
Di,j is the j-th word in the claim description of the i-th claim.
Ψ is the set of parameters for the multinomial distributions of words given topics.
ψZi is the parameter vector for the multinomial distribution of words given topic Zi.
Dir denotes the Dirichlet distribution.
α is the hyperparameter vector for the Dirichlet prior on θ.
qk is the prior distribution for the loss distribution parameters in component k.
βk are the hyperparameters for the prior distribution qk.
γ is the hyperparameter vector for the Dirichlet prior on ψk.
K is the number of topics/components.
n is the number of claims.
∣Di∣ is the length of the i-th document.
Two methods are proposed for model calibration:
An Expectation-Maximization (EM) algorithm for maximum a posteriori (MAP) estimates.
An Metropolis-Hastings (MH)-within-Gibbs sampler algorithm for the posterior distribution.
The EM algorithm is used to obtain the MAP estimates of the parameters, while the MH-within-Gibbs sampler is employed to estimate the posterior distribution. The full conditional distributions for the latent variable Z, the topic distribution θ, and the word distributions ψk are derived, facilitating the implementation of the Gibbs sampler. For the loss distribution parameters ϕk, when conjugate priors are unavailable, the MH algorithm is used to sample from the full conditional distribution.
Model selection is performed using four metrics: the deviance information criterion (DIC), Wasserstein distance, perplexity, and stability. The DIC is used to evaluate the goodness-of-fit of the entire model. The Wasserstein distance evaluates the goodness-of-fit of the finite mixture model for the claim loss, and perplexity and stability evaluate the performance of the DMM for the claim description.
The DIC is defined as:
DIC=pD+D(θ) or equivalently as DIC=D(θˉ)+2pD,
where:
pD=D(θ)−D(θˉ) is the effective number of parameters.
θˉ is the posterior expectation of θ.
D(θ) is the posterior expectation of D(θ).
D(θ)=−2logp(Y,D∣θ,Φ,Ψ) is the deviance.
θ indicates all the parameters.
The Wasserstein distance with index ρ≥1 between two probability measure ξ and ν is defined by:
Wρ(ξ,ν)=π∈Π(ξ,ν)inf(∫R×R∣x−y∣ρπ(dx,dy))1/ρ,
where:
ξ and ν are the probability measures on the real line.
The paper estimates the risk measures Value-at-Risk (VaR) and Conditional Tail Expectation (CTE) for reported but not settled (RBNS) claims. VaR quantifies the maximum loss that a portfolio might suffer over a given time frame, at a given confidence level:
VaRn+1(α)=inf{y∈Yn+1[1:T]:Pr(L>y)≤1−α},
where:
L follows the empirical distribution of Yn+1[1:T].
α is the confidence level.
CTE is a risk measure that considers the average of losses above the VaR threshold:
Yn+1[t] is the t-th simulated sample of the claim loss.
The empirical paper used a realistic synthetic dataset of 90,000 worker compensation insurance policies. The dataset includes claim amounts and textual claim descriptions. The claim amount distribution exhibits multimodality, right skewness, and thick tails. The claim descriptions are short texts, typically ranging from 1 to 14 words. The authors performed pre-processing steps such as lowercasing, stemming, lemmatization, and stop word removal on the textual data.
The paper presents results for various LDMM models with different numbers of components (K=2,3,4,5,6) and different component loss distributions, including log-normal, GB2, and Pareto distributions. The MAP estimates of the parameters are obtained using the EM algorithm, and the posterior distributions are estimated using the MH-within-Gibbs sampler.
The results indicate that the LDMM model can effectively capture the complex characteristics of claim amounts. The model selection results suggest that a mixture of 3 or 4 components is sufficient to capture the complexity in the claims amount distribution. The component analysis reveals that the LDMM model can decompose the empirical loss distribution into components with different characteristics, which can be explained by the corresponding claim description topics.
For example, in a 2-component model, one component may be associated with severe claims related to back and shoulder injuries, while the other component may be associated with mild claims related to finger and eye injuries. In a 4-component model, the components can be further differentiated based on the types of injuries and the circumstances of the accident.
The paper concludes that the LDMM model represents an advancement in individual claims reserving by effectively integrating textual claim descriptions and can capture complex characteristics of claims amount, including their multimodality, skewness, and heavy-tailed nature.