Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data (2401.06866v2)

Published 12 Jan 2024 in cs.CL, cs.AI, and cs.LG
Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

Abstract: LLMs are capable of many natural language tasks, yet they are far from perfect. In health applications, grounding and interpreting domain-specific and non-linguistic data is crucial. This paper investigates the capacity of LLMs to make inferences about health based on contextual information (e.g. user demographics, health knowledge) and physiological data (e.g. resting heart rate, sleep minutes). We present a comprehensive evaluation of 12 state-of-the-art LLMs with prompting and fine-tuning techniques on four public health datasets (PMData, LifeSnaps, GLOBEM and AW_FB). Our experiments cover 10 consumer health prediction tasks in mental health, activity, metabolic, and sleep assessment. Our fine-tuned model, HealthAlpaca exhibits comparable performance to much larger models (GPT-3.5, GPT-4 and Gemini-Pro), achieving the best performance in 8 out of 10 tasks. Ablation studies highlight the effectiveness of context enhancement strategies. Notably, we observe that our context enhancement can yield up to 23.8% improvement in performance. While constructing contextually rich prompts (combining user context, health knowledge and temporal information) exhibits synergistic improvement, the inclusion of health knowledge context in prompts significantly enhances overall performance.

Introduction

LLMs have shown remarkable competencies across various text generation and information retrieval tasks. In healthcare, however, their abilities to process multi-modal data, especially time-series physiological and behavioral data from wearable sensors, are yet to be thoroughly examined. The paper explores this matter by proposing Health-LLM, a framework comprehensively testing the effectiveness of eight cutting-edge LLMs in health predictions augmented with wearables data. The paper ensures robustness by incorporating diverse prompting and fine-tuning techniques and evaluates performance across thirteen key health prediction tasks.

Methodology

Health-LLM's evaluation of the models' performance on health prediction tasks is two-pronged: through zero-shot and few-shot prompting and through instructional fine-tuning. Zero-shot prompting assesses models' in-built knowledge without additional training, while few-shot prompting offers the models a few illustrative examples to learn from. Instructional fine-tuning goes a step further by adapting the whole model to the task-specific data. The paper also examines the benefit of context enhancement in prompts, where supplementary information such as user demographics or health knowledge is strategically included for performance refinement.

Findings

The paper uncovered that zero-shot prompted LLMs tend to perform on par with designated task-specific baseline models. Few-shot prompting, particularly with elaborate models such as GPT-3.5 and GPT-4, demonstrated a noteworthy understanding of the physiological time-series data. The fine-tuned Health-Alpaca model, despite being significantly smaller in size than its GPT counterparts, recorded the best performance in many tasks, underscoring the potential efficiency of LLMs when fine-tuned with health-specific data. Context enhancement was another highlight, with the inclusion of additional context in prompts leading to substantial gains, particularly when health knowledge was involved.

Implications and Ethical Considerations

The implications of this paper are profound for the healthcare domain. The research suggests that LLMs possess a largely untapped potential for predicting health outcomes from wearable sensor data, which could revolutionize patient monitoring and care. However, the authors flag critical ethical considerations such as privacy protection, bias mitigation, and the prevention of "model hallucination," where the model might generate convincing yet incorrect predictions. They call for thorough ethical considerations, enhancing the safety and reliability of LLMs in health applications before their real-world implementations.

In conclusion, this paper paves the way for future research dedicated to refining models' reasoning, enhancing personalization, and addressing data security in healthcare settings. The practical deployment of Health-LLMs could mark a significant step towards achieving AI-driven personalized healthcare but must be navigated responsibly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Multimodal llms for health grounded in individual-specific data.
  2. When large language models meet personalization: Perspectives of challenges and opportunities. arXiv preprint arXiv:2307.16376.
  3. Atrial fibrillation detection using a feedforward neural network. Journal of Medical and Biological Engineering, 42(1):63–73.
  4. Scaling instruction-finetuned language models.
  5. Daniel Fuller. 2020. Replication Data for: Using machine learning methods to predict physical activity types with Apple Watch and Fitbit data using indirect calorimetry as the criterion.
  6. Understanding social reasoning in language models with language models.
  7. Sourojit Ghosh and Aylin Caliskan. 2023. Chatgpt perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across bengali and five other low-resource languages.
  8. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220.
  9. Large language models are zero-shot time series forecasters.
  10. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
  11. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694.
  12. Tabllm: Few-shot classification of tabular data with large language models.
  13. Lora: Low-rank adaptation of large language models.
  14. Time-llm: Time series forecasting by reprogramming large language models.
  15. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
  16. Kiran Kamble and Waseem Alshikh. 2023. Palmyra-med: Instruction-based fine-tuning of llms enhancing medical domain performance.
  17. Publicly shareable clinical large language model built on synthetic clinical notes.
  18. Ethics of large language models in medicine and medical research. The Lancet Digital Health, 5(6):e333–e335.
  19. Large language models are few-shot health learners. arXiv preprint arXiv:2305.15525.
  20. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164.
  21. Predicting depression in adolescents using mobile and wearable sensors: Multimodal machine learning–based exploratory study. JMIR Form Res, 6(6):e35807.
  22. Capabilities of gpt-4 on medical challenge problems.
  23. Bard: A structured technique for group elicitation of bayesian networks to support analytic reasoning. Risk Analysis, 42(6):1155–1178.
  24. OpenAI. 2023a. gpt-3.5-turbo-instruct.
  25. OpenAI. 2023b. Gpt-4 technical report.
  26. Large-scale assessment of a smartwatch to identify atrial fibrillation. New England Journal of Medicine, 381(20):1909–1917. PMID: 31722151.
  27. Mohammad Raeini. 2023. Privacy-preserving large language models (ppllms). Available at SSRN 4512071.
  28. Ai4fooddb: a database for personalized e-health nutrition and lifestyle through wearable devices and artificial intelligence. Database, 2023:baad049.
  29. Wearable and mobile sensors for personalized nutrition. ACS sensors, 6(5):1745–1760.
  30. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
  31. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  32. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  33. Medagents: Large language models as collaborators for zero-shot medical reasoning.
  34. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.
  35. Personalized stress monitoring using wearable sensors in everyday settings. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 7332–7335. IEEE.
  36. Pmdata: A sports logging dataset. In Proceedings of the 11th ACM Multimedia Systems Conference, MMSys ’20, page 231–236, New York, NY, USA. Association for Computing Machinery.
  37. Large language models in medicine. Nature medicine, 29(8):1930–1940.
  38. Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding.
  39. Ensemble machine learning model trained on a new synthesized dataset generalizes well for stress prediction using wearable devices. Journal of Biomedical Informatics, 148:104556.
  40. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  41. Self-consistency improves chain of thought reasoning in language models.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  43. Transformers in time series: A survey.
  44. Pmc-llama: Towards building open-source language models for medicine.
  45. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
  46. Bloomberggpt: A large language model for finance.
  47. Mental-llm: Leveraging large language models for mental health prediction via online text data.
  48. Globem dataset: Multi-year datasets for longitudinal human behavior modeling generalization. Advances in Neural Information Processing Systems, 35:24655–24692.
  49. Lifesnaps, a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild. Scientific Data, 9(1):663.
  50. Coding inequity: Assessing gpt-4’s potential for perpetuating racial and gender biases in healthcare. medRxiv, pages 2023–07.
  51. Enhancing small medical learners with privacy-preserving contextual prompting.
  52. Informer: Beyond efficient transformer for long sequence time-series forecasting.
  53. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yubin Kim (15 papers)
  2. Xuhai Xu (38 papers)
  3. Daniel McDuff (88 papers)
  4. Cynthia Breazeal (48 papers)
  5. Hae Won Park (25 papers)
Citations (44)