Papers
Topics
Authors
Recent
Search
2000 character limit reached

Combining Structural and Unstructured Data: A Topic-based Finite Mixture Model for Insurance Claim Prediction

Published 7 Oct 2024 in stat.AP and cs.LG | (2410.04684v1)

Abstract: Modeling insurance claim amounts and classifying claims into different risk levels are critical yet challenging tasks. Traditional predictive models for insurance claims often overlook the valuable information embedded in claim descriptions. This paper introduces a novel approach by developing a joint mixture model that integrates both claim descriptions and claim amounts. Our method establishes a probabilistic link between textual descriptions and loss amounts, enhancing the accuracy of claims clustering and prediction. In our proposed model, the latent topic/component indicator serves as a proxy for both the thematic content of the claim description and the component of loss distributions. Specifically, conditioned on the topic/component indicator, the claim description follows a multinomial distribution, while the claim amount follows a component loss distribution. We propose two methods for model calibration: an EM algorithm for maximum a posteriori estimates, and an MH-within-Gibbs sampler algorithm for the posterior distribution. The empirical study demonstrates that the proposed methods work effectively, providing interpretable claims clustering and prediction.

Authors (3)

Summary

  • The paper introduces the Loss Dirichlet Multinomial Mixture (LDMM) model for insurance claim prediction, integrating both structured claim amounts and unstructured textual descriptions.
  • LDMM creates a probabilistic link between claim descriptions and loss amounts, enhancing claims clustering and improving the ability to capture complex loss characteristics like multimodality and heavy tails.
  • The model, calibrated using EM or MH-within-Gibbs algorithms, allows for estimating risk measures like VaR and CTE, providing a new approach for individual claims reserving by leveraging textual data.

This paper introduces a novel topic-based finite mixture model, the Loss Dirichlet Multinomial Mixture (LDMM), for insurance claim prediction by combining both structured (claim amounts) and unstructured (claim descriptions) data. The core idea is to establish a probabilistic link between textual descriptions and loss amounts, thereby enhancing the accuracy of claims clustering and prediction.

The LDMM model posits that a claim case can be represented as a triplet (Y,D,Z)(Y, D, Z), where YY is the claim loss, DD is the textual claim description, and ZZ is an unobserved categorical variable indicating the topic or component. The model assumes that, conditioned on the topic/component indicator ZZ, the claim description DD follows a multinomial distribution, while the claim amount YY follows a component loss distribution. The words DjD_j in a document are assumed to be generated from a Dirichlet Multinomial Mixture (DMM) model.

The joint distribution is given by:

Zi∣θ∼Dis(θ),i=1,…,n,Z_i|\boldsymbol{\theta} \sim Dis(\boldsymbol\theta), i=1,\ldots,n,

Yi∣Zi,Φ∼pZi(ϕZi),i=1,…,n,Y_i|Z_i, \boldsymbol\Phi \sim p_{Z_i}(\boldsymbol\phi_{Z_i}), i=1,\ldots,n,

YY0

YY1

YY2

YY3

where:

  • YY4 is the topic indicator for the YY5-th claim.
  • YY6 is the parameter vector for the discrete distribution of topics.
  • YY7 denotes the discrete distribution.
  • YY8 is the claim amount for the YY9-th claim.
  • DD0 is the set of parameters for the component loss distributions.
  • DD1 is the probability density function of the loss distribution for topic DD2.
  • DD3 are the parameters of the DD4 distribution.
  • DD5 is the DD6-th word in the claim description of the DD7-th claim.
  • DD8 is the set of parameters for the multinomial distributions of words given topics.
  • DD9 is the parameter vector for the multinomial distribution of words given topic ZZ0.
  • ZZ1 denotes the Dirichlet distribution.
  • ZZ2 is the hyperparameter vector for the Dirichlet prior on ZZ3.
  • ZZ4 is the prior distribution for the loss distribution parameters in component ZZ5.
  • ZZ6 are the hyperparameters for the prior distribution ZZ7.
  • ZZ8 is the hyperparameter vector for the Dirichlet prior on ZZ9.
  • ZZ0 is the number of topics/components.
  • ZZ1 is the number of claims.
  • ZZ2 is the length of the ZZ3-th document.

Two methods are proposed for model calibration:

  1. An Expectation-Maximization (EM) algorithm for maximum a posteriori (MAP) estimates.
  2. An Metropolis-Hastings (MH)-within-Gibbs sampler algorithm for the posterior distribution.

The EM algorithm is used to obtain the MAP estimates of the parameters, while the MH-within-Gibbs sampler is employed to estimate the posterior distribution. The full conditional distributions for the latent variable ZZ4, the topic distribution ZZ5, and the word distributions ZZ6 are derived, facilitating the implementation of the Gibbs sampler. For the loss distribution parameters ZZ7, when conjugate priors are unavailable, the MH algorithm is used to sample from the full conditional distribution.

Model selection is performed using four metrics: the deviance information criterion (DIC), Wasserstein distance, perplexity, and stability. The DIC is used to evaluate the goodness-of-fit of the entire model. The Wasserstein distance evaluates the goodness-of-fit of the finite mixture model for the claim loss, and perplexity and stability evaluate the performance of the DMM for the claim description.

The DIC is defined as:

ZZ8

where:

  • ZZ9 is the effective number of parameters.
  • DD0 is the posterior expectation of DD1.
  • DD2 is the posterior expectation of DD3.
  • DD4 is the deviance.
  • DD5 indicates all the parameters.

The Wasserstein distance with index DD6 between two probability measure DD7 and DD8 is defined by:

DD9

where:

  • YY0 and YY1 are the probability measures on the real line.
  • YY2 is the set of couplings between YY3 and YY4.

Perplexity is calculated as:

YY5

where:

  • YY6 denotes the set of test indices.
  • YY7 is the number of words in the YY8-th document.

The stability of parameters for topic YY9 is defined as:

DjD_j0

where:

  • DjD_j1 is a vector similarity function.
  • DjD_j2 is the posterior mean of DjD_j3.

The posterior predictive distribution of the claim loss DjD_j4 given the claim description DjD_j5 is:

DjD_j6

The paper estimates the risk measures Value-at-Risk (VaR) and Conditional Tail Expectation (CTE) for reported but not settled (RBNS) claims. VaR quantifies the maximum loss that a portfolio might suffer over a given time frame, at a given confidence level:

DjD_j7

where:

  • DjD_j8 follows the empirical distribution of DjD_j9.
  • Zi∣θ∼Dis(θ),i=1,…,n,Z_i|\boldsymbol{\theta} \sim Dis(\boldsymbol\theta), i=1,\ldots,n,0 is the confidence level.

CTE is a risk measure that considers the average of losses above the VaR threshold:

Zi∣θ∼Dis(θ),i=1,…,n,Z_i|\boldsymbol{\theta} \sim Dis(\boldsymbol\theta), i=1,\ldots,n,1

where:

  • Zi∣θ∼Dis(θ),i=1,…,n,Z_i|\boldsymbol{\theta} \sim Dis(\boldsymbol\theta), i=1,\ldots,n,2 is the Zi∣θ∼Dis(θ),i=1,…,n,Z_i|\boldsymbol{\theta} \sim Dis(\boldsymbol\theta), i=1,\ldots,n,3-th simulated sample of the claim loss.

The empirical study used a realistic synthetic dataset of 90,000 worker compensation insurance policies. The dataset includes claim amounts and textual claim descriptions. The claim amount distribution exhibits multimodality, right skewness, and thick tails. The claim descriptions are short texts, typically ranging from 1 to 14 words. The authors performed pre-processing steps such as lowercasing, stemming, lemmatization, and stop word removal on the textual data.

The paper presents results for various LDMM models with different numbers of components (Zi∣θ∼Dis(θ),i=1,…,n,Z_i|\boldsymbol{\theta} \sim Dis(\boldsymbol\theta), i=1,\ldots,n,4) and different component loss distributions, including log-normal, GB2, and Pareto distributions. The MAP estimates of the parameters are obtained using the EM algorithm, and the posterior distributions are estimated using the MH-within-Gibbs sampler.

The results indicate that the LDMM model can effectively capture the complex characteristics of claim amounts. The model selection results suggest that a mixture of 3 or 4 components is sufficient to capture the complexity in the claims amount distribution. The component analysis reveals that the LDMM model can decompose the empirical loss distribution into components with different characteristics, which can be explained by the corresponding claim description topics.

For example, in a 2-component model, one component may be associated with severe claims related to back and shoulder injuries, while the other component may be associated with mild claims related to finger and eye injuries. In a 4-component model, the components can be further differentiated based on the types of injuries and the circumstances of the accident.

The paper concludes that the LDMM model represents an advancement in individual claims reserving by effectively integrating textual claim descriptions and can capture complex characteristics of claims amount, including their multimodality, skewness, and heavy-tailed nature.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.