Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Combining Structural and Unstructured Data: A Topic-based Finite Mixture Model for Insurance Claim Prediction (2410.04684v1)

Published 7 Oct 2024 in stat.AP and cs.LG

Abstract: Modeling insurance claim amounts and classifying claims into different risk levels are critical yet challenging tasks. Traditional predictive models for insurance claims often overlook the valuable information embedded in claim descriptions. This paper introduces a novel approach by developing a joint mixture model that integrates both claim descriptions and claim amounts. Our method establishes a probabilistic link between textual descriptions and loss amounts, enhancing the accuracy of claims clustering and prediction. In our proposed model, the latent topic/component indicator serves as a proxy for both the thematic content of the claim description and the component of loss distributions. Specifically, conditioned on the topic/component indicator, the claim description follows a multinomial distribution, while the claim amount follows a component loss distribution. We propose two methods for model calibration: an EM algorithm for maximum a posteriori estimates, and an MH-within-Gibbs sampler algorithm for the posterior distribution. The empirical study demonstrates that the proposed methods work effectively, providing interpretable claims clustering and prediction.

Summary

  • The paper introduces the Loss Dirichlet Multinomial Mixture (LDMM) model for insurance claim prediction, integrating both structured claim amounts and unstructured textual descriptions.
  • LDMM creates a probabilistic link between claim descriptions and loss amounts, enhancing claims clustering and improving the ability to capture complex loss characteristics like multimodality and heavy tails.
  • The model, calibrated using EM or MH-within-Gibbs algorithms, allows for estimating risk measures like VaR and CTE, providing a new approach for individual claims reserving by leveraging textual data.

This paper introduces a novel topic-based finite mixture model, the Loss Dirichlet Multinomial Mixture (LDMM), for insurance claim prediction by combining both structured (claim amounts) and unstructured (claim descriptions) data. The core idea is to establish a probabilistic link between textual descriptions and loss amounts, thereby enhancing the accuracy of claims clustering and prediction.

The LDMM model posits that a claim case can be represented as a triplet (Y,D,Z)(Y, D, Z), where YY is the claim loss, DD is the textual claim description, and ZZ is an unobserved categorical variable indicating the topic or component. The model assumes that, conditioned on the topic/component indicator ZZ, the claim description DD follows a multinomial distribution, while the claim amount YY follows a component loss distribution. The words DjD_j in a document are assumed to be generated from a Dirichlet Multinomial Mixture (DMM) model.

The joint distribution is given by:

ZiθDis(θ),i=1,,n,Z_i|\boldsymbol{\theta} \sim Dis(\boldsymbol\theta), i=1,\ldots,n,

YiZi,ΦpZi(ϕZi),i=1,,n,Y_i|Z_i, \boldsymbol\Phi \sim p_{Z_i}(\boldsymbol\phi_{Z_i}), i=1,\ldots,n,

Di,jZi,ΨDis(ψZi),i=1,,n,j=1,,Di,D_{i,j}|Z_i, \boldsymbol{\Psi} \sim Dis(\boldsymbol\psi_{Z_i}), i=1,\ldots,n, j=1,\ldots,|D_i|,

θDir(α),\boldsymbol{\theta}\sim Dir(\boldsymbol{\alpha}),

ϕkqk(βk),k=1,,K,\boldsymbol{\phi}_k \sim q_k(\boldsymbol{\beta}_k), k=1,\ldots,K,

ψkDir(γ),k=1,,K,\boldsymbol{\psi}_k\sim Dir(\boldsymbol{\gamma}), k=1,\ldots, K,

where:

  • ZiZ_i is the topic indicator for the ii-th claim.
  • θ\boldsymbol{\theta} is the parameter vector for the discrete distribution of topics.
  • DisDis denotes the discrete distribution.
  • YiY_i is the claim amount for the ii-th claim.
  • Φ\boldsymbol{\Phi} is the set of parameters for the component loss distributions.
  • pZip_{Z_i} is the probability density function of the loss distribution for topic ZiZ_i.
  • ϕZi\boldsymbol{\phi}_{Z_i} are the parameters of the pZip_{Z_i} distribution.
  • Di,jD_{i,j} is the jj-th word in the claim description of the ii-th claim.
  • Ψ\boldsymbol{\Psi} is the set of parameters for the multinomial distributions of words given topics.
  • ψZi\boldsymbol{\psi}_{Z_i} is the parameter vector for the multinomial distribution of words given topic ZiZ_i.
  • DirDir denotes the Dirichlet distribution.
  • α\boldsymbol{\alpha} is the hyperparameter vector for the Dirichlet prior on θ\boldsymbol{\theta}.
  • qkq_k is the prior distribution for the loss distribution parameters in component kk.
  • βk\boldsymbol{\beta}_k are the hyperparameters for the prior distribution qkq_k.
  • γ\boldsymbol{\gamma} is the hyperparameter vector for the Dirichlet prior on ψk\boldsymbol{\psi}_k.
  • KK is the number of topics/components.
  • nn is the number of claims.
  • Di|D_i| is the length of the ii-th document.

Two methods are proposed for model calibration:

  1. An Expectation-Maximization (EM) algorithm for maximum a posteriori (MAP) estimates.
  2. An Metropolis-Hastings (MH)-within-Gibbs sampler algorithm for the posterior distribution.

The EM algorithm is used to obtain the MAP estimates of the parameters, while the MH-within-Gibbs sampler is employed to estimate the posterior distribution. The full conditional distributions for the latent variable ZZ, the topic distribution θ\boldsymbol{\theta}, and the word distributions ψk\boldsymbol{\psi}_k are derived, facilitating the implementation of the Gibbs sampler. For the loss distribution parameters ϕk\boldsymbol{\phi}_k, when conjugate priors are unavailable, the MH algorithm is used to sample from the full conditional distribution.

Model selection is performed using four metrics: the deviance information criterion (DIC), Wasserstein distance, perplexity, and stability. The DIC is used to evaluate the goodness-of-fit of the entire model. The Wasserstein distance evaluates the goodness-of-fit of the finite mixture model for the claim loss, and perplexity and stability evaluate the performance of the DMM for the claim description.

The DIC is defined as:

DIC=pD+D(θ) or equivalently as DIC=D(θˉ)+2pD,DIC = p_D + \overline{D(\theta)} \text{ or equivalently as } DIC = D(\bar\theta) + 2p_D,

where:

  • pD=D(θ)D(θˉ)p_D = \overline{D(\theta)} - D(\bar\theta) is the effective number of parameters.
  • θˉ\bar\theta is the posterior expectation of θ\theta.
  • D(θ)\overline{D(\theta)} is the posterior expectation of D(θ)D(\theta).
  • D(θ)=2logp(Y,Dθ,Φ,Ψ)D(\theta) = -2 \log p(Y,D|\boldsymbol{\theta}, \boldsymbol{\Phi}, \boldsymbol{\Psi}) is the deviance.
  • θ\theta indicates all the parameters.

The Wasserstein distance with index ρ1\rho\ge 1 between two probability measure ξ\xi and ν\nu is defined by:

Wρ(ξ,ν)=infπΠ(ξ,ν)(R×Rxyρπ(dx,dy))1/ρ,\mathcal W_\rho(\xi,\nu) = \inf_{\pi\in\Pi(\xi,\nu)}\left(\int_{\mathbb R\times \mathbb R}|x-y|^{\rho}\pi(dx,dy)\right)^{1/\rho},

where:

  • ξ\xi and ν\nu are the probability measures on the real line.
  • Π(ξ,ν)\Pi(\xi,\nu) is the set of couplings between ξ\xi and ν\nu.

Perplexity is calculated as:

perplexity=exp{iItestv=1Vlog(k=1K(ψk,v)Ni,vP(Zi=k))iItestDi},\text{perplexity} = \exp\left\{-\frac{\sum_{i\in\mathcal{I}_{test}}\sum_{v=1}^{|V|}\log\left(\sum_{k = 1}^K (\psi_{k, v})^{N_{i,v}}P(Z_i=k)\right)}{\sum_{i\in\mathcal{I}_{test}} |D_i|}\right\},

where:

  • Itest\mathcal{I}_{test} denotes the set of test indices.
  • Di|D_i| is the number of words in the ii-th document.

The stability of parameters for topic kk is defined as:

stabilityk=1Tt=1Tsim(ψk[t],ψˉk),\text{stability}_k = \frac{1}{T}\sum_{t=1}^T sim(\psi_k^{[t]}, \bar{\psi}_k),

where:

  • simsim is a vector similarity function.
  • ψˉk\bar{\psi}_k is the posterior mean of ψk\psi_k.

The posterior predictive distribution of the claim loss Yn+1Y_{n+1} given the claim description Dn+1D_{n+1} is:

p(Yn+1Dn+1,Y,D)=k=1Kθ,ϕk,ψkpk(Yn+1;ϕk)Pr(Zn+1=kDn+1,θk,ψk)p(θ,Φ,ΨY,D)dθdΦdΨ.p(Y_{n+1}| D_{n+1}, Y,D) = \sum_{k=1}^K \int_{\boldsymbol{\theta},\boldsymbol{\phi}_k,\boldsymbol{\psi}_k} p_k(Y_{n+1}; \boldsymbol{\phi}_k) \Pr(Z_{n+1}=k| D_{n+1}, \theta_k,\psi_k) p(\boldsymbol{\theta},\boldsymbol{\Phi},\boldsymbol{\Psi}|Y,D)d\boldsymbol{\theta} d\boldsymbol{\Phi} d\boldsymbol{\Psi}.

The paper estimates the risk measures Value-at-Risk (VaR) and Conditional Tail Expectation (CTE) for reported but not settled (RBNS) claims. VaR quantifies the maximum loss that a portfolio might suffer over a given time frame, at a given confidence level:

VaRn+1(α)=inf{yYn+1[1:T]:Pr(L>y)1α},VaR_{n+1}(\alpha) = \inf \left\{y \in Y_{n+1}^{[1:T]}: \Pr(L > y) \leq 1 - \alpha\right\},

where:

  • LL follows the empirical distribution of Yn+1[1:T]Y_{n+1}^{[1:T]}.
  • α\alpha is the confidence level.

CTE is a risk measure that considers the average of losses above the VaR threshold:

$CTE_{n+1}(\alpha) =\frac{\sum_{t:Y_{n+1}^{[t]}>VaR_{n+1}(\alpha) } Y_{n+1}^{[t]}{\left|\left\{y \in Y_{n+1}^{[1:T]}: y>VaR_{n+1}(\alpha) \right\}\right|},$

where:

  • Yn+1[t]Y_{n+1}^{[t]} is the tt-th simulated sample of the claim loss.

The empirical paper used a realistic synthetic dataset of 90,000 worker compensation insurance policies. The dataset includes claim amounts and textual claim descriptions. The claim amount distribution exhibits multimodality, right skewness, and thick tails. The claim descriptions are short texts, typically ranging from 1 to 14 words. The authors performed pre-processing steps such as lowercasing, stemming, lemmatization, and stop word removal on the textual data.

The paper presents results for various LDMM models with different numbers of components (K=2,3,4,5,6K = 2, 3, 4, 5, 6) and different component loss distributions, including log-normal, GB2, and Pareto distributions. The MAP estimates of the parameters are obtained using the EM algorithm, and the posterior distributions are estimated using the MH-within-Gibbs sampler.

The results indicate that the LDMM model can effectively capture the complex characteristics of claim amounts. The model selection results suggest that a mixture of 3 or 4 components is sufficient to capture the complexity in the claims amount distribution. The component analysis reveals that the LDMM model can decompose the empirical loss distribution into components with different characteristics, which can be explained by the corresponding claim description topics.

For example, in a 2-component model, one component may be associated with severe claims related to back and shoulder injuries, while the other component may be associated with mild claims related to finger and eye injuries. In a 4-component model, the components can be further differentiated based on the types of injuries and the circumstances of the accident.

The paper concludes that the LDMM model represents an advancement in individual claims reserving by effectively integrating textual claim descriptions and can capture complex characteristics of claims amount, including their multimodality, skewness, and heavy-tailed nature.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.