Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models (2505.24655v1)

Published 30 May 2025 in cs.AI and cs.LG

Abstract: Cardiovascular disease (CVD) risk prediction models are essential for identifying high-risk individuals and guiding preventive actions. However, existing models struggle with the challenges of real-world clinical practice as they oversimplify patient profiles, rely on rigid input schemas, and are sensitive to distribution shifts. We developed AdaCVD, an adaptable CVD risk prediction framework built on LLMs extensively fine-tuned on over half a million participants from the UK Biobank. In benchmark comparisons, AdaCVD surpasses established risk scores and standard machine learning approaches, achieving state-of-the-art performance. Crucially, for the first time, it addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data. In stratified analyses, it demonstrates robust performance across demographic, socioeconomic, and clinical subgroups, including underrepresented cohorts. AdaCVD offers a promising path toward more flexible, AI-driven clinical decision support tools suited to the realities of heterogeneous and dynamic healthcare environments.

Collections

Summary

Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data Using LLMs

The paper introduced a sophisticated cardiovascular disease (CVD) risk prediction framework called AdaCVD, which leverages LLMs to address significant challenges in real-world clinical settings. Traditional CVD risk prediction models predominantly utilize fixed sets of input variables and are limited by their reliance on complete, structured data. These frameworks are often inadequate in handling the diverse, incomplete, and heterogeneous data typical of clinical practice. AdaCVD was developed to address these limitations by employing LLMs fine-tuned on a substantial cohort from the UK Biobank, comprising over half a million participants.

AdaCVD demonstrates several key advancements in CVD risk prediction. It surpasses established medical risk scores and conventional machine learning models, achieving state-of-the-art performance when trained using a limited set of well-known risk factors. Notably, it exhibits robust adaptability by embracing a wide array of health-related data types, including both structured data and unstructured text, and retains strong predictive capabilities even with incomplete information. This flexibility enables AdaCVD to seamlessly integrate comprehensive patient profiles, thus refining risk assessment particularly for underrepresented groups, such as the elderly, smokers, and individuals with diabetes.

A significant aspect of the AdaCVD framework is its capacity to process textual inputs, a prevalent data modality in clinical practice through formats like clinical notes and physician reports. The paper shows that AdaCVD effectively reasons over unstructured input and adapts from structured data representations to free-text formats with high data efficiency. This suggests potential for broader application across diverse clinical decision-making tasks, where textual data is abundant yet traditionally underused in machine learning models due to its complexity.

AdaCVD's adaptability extends to efficiently managing distribution shifts, a common challenge when models trained in one environment are deployed in another. The research demonstrates successful adaptation to a patient cohort from the Framingham Heart Study, evidencing the model's robustness to geographic and temporal variations in patient data.

Overall, AdaCVD signifies a substantial advancement in integrating AI models into clinical environments. Its development highlights the potential benefits of adopting LLMs for enhancing clinical predictions through their ability to process and reason over variable and unstructured data inputs. The research indicates a promising direction for the use of LLMs in healthcare by enabling flexible, robust, and context-aware risk prediction models that align more closely with the dynamic and multifaceted nature of healthcare data. Future research may explore the incorporation of multimodal data inputs, such as combining textual data with imaging modalities to further augment predictive capabilities. Additionally, wider validation across global populations could substantiate the model's applicability and address potential biases inherent to the dataset it was initially trained on.

Through the innovative application of pre-training and fine-tuning paradigms, the AdaCVD framework offers a robust solution to the inherent complexities of predicting CVD risk, setting a foundation for future improvements in AI-driven clinical tools. As the healthcare domain increasingly integrates digital data, approaches like AdaCVD could lead to more personalized and effective clinical interventions, potentially transforming preventive healthcare strategies worldwide.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (7)

Tweets

https://twitter.com/FrederikeLubeck/status/1946270420027498804

YouTube

Show All Videos