Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Business Insurance Loss Models by Leveraging InsurTech Innovation (2401.16723v1)

Published 30 Jan 2024 in q-fin.RM

Abstract: Recent transformative and disruptive advancements in the insurance industry have embraced various InsurTech innovations. In particular, with the rapid progress in data science and computational capabilities, InsurTech is able to integrate a multitude of emerging data sources, shedding light on opportunities to enhance risk classification and claims management. This paper presents a groundbreaking effort as we combine real-life proprietary insurance claims information together with InsurTech data to enhance the loss model, a fundamental component of insurance companies' risk management. Our study further utilizes various machine learning techniques to quantify the predictive improvement of the InsurTech-enhanced loss model over that of the insurance in-house. The quantification process provides a deeper understanding of the value of the InsurTech innovation and advocates potential risk factors that are unexplored in traditional insurance loss modeling. This study represents a successful undertaking of an academic-industry collaboration, suggesting an inspiring path for future partnerships between industry and academic institutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631.
  2. Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):1059–1086.
  3. Business model transformation through artificial intelligence in the israeli insurtech. In ISPIM Connect Valencia.
  4. Exposure as duration and distance in telematics motor insurance using generalized additive models. Risks, 5(4):54.
  5. Identifying sources of variation and the flow of information in biochemical networks. Proceedings of the National Academy of Sciences, 109(20):E1320–E1328.
  6. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
  7. Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.
  8. Usage-based insurance—impact on insurers and potential implications for insurtech. North American Actuarial Journal, 26(3):428–455.
  9. Commercial lines insurtech: A pathway to digital. Technical report, McKinsey & Company.
  10. Implementing local-explainability in gradient boosting trees: Feature contribution. Information Sciences, 589:199–212.
  11. Actuarial modelling of claim counts: Risk classification, credibility and bonus-malus systems. John Wiley & Sons.
  12. New technologies and data in insurance. The Geneva Papers on Risk and Insurance - Issues and Practice, 47:495–498.
  13. Credibility prediction using collateral information. Variance, 11(1):45–59.
  14. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 29(5):1189–1232.
  15. Claims frequency modeling using telematics car driving data. Scandinavian Actuarial Journal, 2019(2):143–162.
  16. Guelman, L. (2012). Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Systems with Applications, 39(3):3659–3667.
  17. The use of telematics devices to improve automobile insurance rates. Risk analysis, 39(3):662–672.
  18. Near‐miss telematics in motor insurance. Journal of Risk & Insurance, 88(3):569–589.
  19. Can automobile insurance telematics predict the risk of near-miss events? North American Actuarial Journal, 24(1):141–152.
  20. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
  21. Boosting insights in insurance tariff plans with tree-based machine learning methods. North American Actuarial Journal, 25(2):255–285.
  22. Imbalanced learning for insurance using modified loss functions in tree-based models. Insurance: Mathematics and Economics, 106:13–32.
  23. Automobile insurance classification ratemaking based on telematics driving data. Decision Support Systems, 127:113156.
  24. Fitting tweedie’s compound poisson model to insurance claims data. Scandinavian Actuarial Journal, 1994(1):69–93.
  25. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30.
  26. Insurtech: A guide for the actuarial community. Technical report, Willis Tower Watson. Published by the Society of Actuaries.
  27. Hal: Computer system for scalable deep learning. In Practice and Experience in Advanced Research Computing, pages 41–48. Association for Computing Machinery.
  28. Delta boosting machine with application to general insurance. North American Actuarial Journal, 22(3):405–425.
  29. Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):14–23.
  30. A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc.
  31. A conceptual model for pricing health and life insurance using wearable technology: Pricing insurance using wearable technology. Risk Management and Insurance Review, 21(3):389–411.
  32. Naylor, M. (2017). Types of insurance. In Insurance Transformed: technological disruption, pages 41–45. Palgrave Macmillan, Cham.
  33. Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3):370–384.
  34. Predicting motor insurance claims using telematics data—xgboost versus logistic regression. Risks, 7(2):70.
  35. Tweedie’s compound poisson model with grouped elastic net. Journal of Computational and Graphical Statistics, 25(2):606–625.
  36. Predictive analytics of insurance claims using multivariate decision trees. Dependence Modeling, 6(1):377–407.
  37. The development of insurtech in europe and the strategic response of incumbents. In Disruptive Technology in Banking and Finance: An International Perspective on FinTech, pages 135–160. Springer International Publishing.
  38. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
  39. Decision tree methods: applications for classification and prediction. Shanghai Archives of Psychiatry, 27(2):130–135.
  40. Wearables and the internet of things: considerations for the life and health insurance industry. British Actuarial Journal, 24:1–31.
  41. Suryavanshi, U. (2022). The insurtech revolution in insurance industry: Emerging trends, challenges and opportunities. International Journal of Management and Development Studies, 11:12–19.
  42. The INSURTECH Book: The Insurance Technology Handbook for Investors, Entrepreneurs and FinTech Visionaries. John Wiley & Sons.
  43. Unravelling the predictive power of telematics data in car insurance pricing. Journal of the Royal Statistical Society: Series C (Applied Statistics), 67(5):1275–1304.
  44. Wang, Q. (2021). The impact of insurtech on chinese insurance industry. Procedia Computer Science, 187:30–35.
  45. Telematic driving profile classification in car insurance pricing. Annals of actuarial science, 11(2):213–236.
  46. Data analytics for non-life insurance pricing. Technical Report 16-68, Swiss Finance Institute.
  47. A framework for the evaluation of insurtech. Risk Management and Insurance Review, 23(4):305–329.
  48. Insurance premium prediction via gradient tree-boosted tweedie compound poisson models. Journal of Business & Economic Statistics, 36(3):456–470.
  49. Tweedie gradient boosting for extremely unbalanced zero-inflated data. Communications in Statistics-Simulation and Computation, 51(9):5507–5529.
  50. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67:301–320.
Citations (1)

Summary

  • The paper shows that InsurTech-enhanced models consistently outperform traditional in-house models in predicting business insurance loss costs.
  • The paper employs advanced methods such as LightGBM with Bayesian optimization and Tweedie GLM with elastic net, validated via 10-fold cross-validation.
  • The paper provides interpretability using feature importance measures, ALE plots, and SHAP values to explain key risk factors in business insurance.

You are interested in learning about a paper that studies how InsurTech innovations can improve business insurance loss models. The paper presents a three-party research collaboration between an InsurTech company (Carpe Data), an insurance company, and a university (University of Illinois Urbana-Champaign's IRisk Lab). The authors combined proprietary insurance claims information from the insurance company with InsurTech data from Carpe Data.

The goal of the paper is two-fold:

  1. To mine predictive risk characteristics from InsurTech data and show the improvement compared to an insurance in-house loss model.
  2. Explain these risk characteristics using interpretable machine learning techniques and propose potential rating factors for business insurance.

The paper focuses on Business Owner's Policy (BOP) insurance. BOP bundles multiple insurance coverages, such as property insurance and liability insurance, to protect small- and medium-sized business owners from risks.

The dataset has three coverage types:

  1. Business Building (BG): covers the loss related to the buildings.
  2. Business Personal Property (BP): covers the risks of potential loss, damage, and liability issues for business-use property.
  3. Liability (LIAB): covers the risks of potential legal liability for losses caused by policyholders to a third party.

The response variable is the BOP loss cost during the observation period (2010 to 2020).

The InsurTech data from Carpe Data includes the following categories of features:

  • Business Information: basic information about business operations (e.g., coordinates, address, operating hours).
  • Firmographics: business segmentation characteristics (e.g., business size, company type).
  • Classification: categorization of a business (e.g., category, segment, NAICS code).
  • Risk Characteristics: features identifying potential risks (e.g., presence of alcohol, chemicals, or outdoor heaters).
  • Index: a suite of indexes on a 1-5 scale targeting dimensions of risk (e.g., customer rating, visibility, reputation).
  • Proximity Score: risks associated with surrounding businesses (e.g., proximity to combustibles, entertainment, traffic).
  • Territory Risk: density scores of risks within a zip code area.
  • Text Data: webpage content and customer reviews.

The combined dataset used for modeling has 825,622 observations and 596 features. The authors performed data cleaning, feature engineering, and set up a relational database.

The paper illustrates how InsurTech data provides additional information using the law of total variance:

Var(Y)=E[Var(YXIH,XIT)]+E[Var(E[YXIH,XIT]XIH)]+Var(E[YXIH])\operatorname{Var}(Y) = \operatorname{E}\left[\operatorname{Var}\left(Y \mid X^{IH}, X^{IT}\right)\right] + \operatorname{E}[\operatorname{Var}(\operatorname{E}\left[Y \mid X^{IH}, X^{IT}\right] \mid X^{IH})] + \operatorname{Var}(\operatorname{E}\left[Y \mid X^{IH}\right])

Where:

  • YY is the claim amount.
  • XIHX^{IH} represents in-house rating factors.
  • XITX^{IT} represents risk factors from InsurTech.

The authors separately modeled the three coverage groups (BG, BP, LIAB). They calibrate the Light Gradient-Boosting Machine (LightGBM) and Tweedie Generalized Linear Model (GLM) with elastic net feature selection. For LightGBM model calibration, they used Bayesian optimization. The optimization objective loss function is the Mean Absolute Error (MAE). They used grid search for Tweedie GLM. They also applied a 10-fold cross-validation to find the best models. They used double lift charts to visually compare the predictive performance. They used the Gini index, Percentage Error (PE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) to examine predictive accuracy.

The InsurTech-enhanced models (LightGBM and Tweedie GLM with elastic net) consistently outperformed the insurance in-house model for each coverage group. This improvement is irrespective of the chosen loss model. The InsurTech-enhanced models significantly reduce the absolute value of PE, indicating better predictive performance at the portfolio level.

The authors use several techniques for model interpretation:

  • Feature Importance:
    • Mean Decrease in Impurity (MDI).
    • Mean Decrease in Accuracy (MDA).
    • SHapley Additive exPlanations (SHAP).
  • Accumulated Local Effects (ALE) Plots: To visualize the average impact of a feature on the predictions.
  • Illustrative Individual Cases: Using SHAP values to explain individual predictions.

The model interpretability revealed several risk factors derived from InsurTech data, including:

  • Coordinates of addresses
  • Proximity scores
  • Territory risks (especially version 2, e.g., TERRITORY.j2, related to fire risk)
  • Visibility indexes
  • Review scores
  • Business classification and segment proportions

ALE plots for proximity traffic scores demonstrated how these features impact the observed claim loss. Illustrative individual cases with SHAP values demonstrated how the features influenced individual predictions.