Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MEIT: Multi-Modal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation (2403.04945v3)

Published 7 Mar 2024 in cs.CL, cs.LG, and eess.SP

Abstract: Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT's results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, and resilience to signal perturbation. These findings emphasize the efficacy of our MEIT framework and its potential for real-world clinical application.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  4. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp.  65–72, 2005.
  5. Slam: A unified encoder for speech and language modeling via speech-text joint pre-training. arXiv preprint arXiv:2110.10329, 2021.
  6. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021. URL https://doi.org/10.5281/zenodo.5297715. If you use this software, please cite it using these metadata.
  7. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  8. An international comparison of long-term average speech spectra. The journal of the acoustical society of America, 96(4):2108–2120, 1994.
  9. Mam: Masked acoustic modeling for end-to-end speech-to-text translation. arXiv preprint arXiv:2010.11445, 2020.
  10. Me-gan: Learning panoptic electrocardio representations for multi-view ecg synthesis conditioned on heart diseases. In International Conference on Machine Learning, pp.  3360–3370. PMLR, 2022a.
  11. Cross-modal memory networks for radiology report generation. arXiv preprint arXiv:2204.13258, 2022b.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  13. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  14. Mimic-iv-ecg-diagnostic electrocardiogram matched subset.
  15. Complex organ mask guided radiology report generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  7995–8004, 2024.
  16. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  17. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  18. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  19. Spatiotemporal self-supervised representation learning from multi-lead ecg signals. Biomedical Signal Processing and Control, 84:104772, 2023.
  20. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  21. On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195, 2017.
  22. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  23. Lavry, D. The optimal sample rate for quality audio. Lavry Engineering Inc, 2012.
  24. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  25. Frozen language model helps ecg zero-shot learning. arXiv preprint arXiv:2303.12311, 2023b.
  26. Hybrid retrieval-generation reinforced agent for medical image report generation. Advances in neural information processing systems, 31, 2018.
  27. Unify, align and refine: Multi-level semantic alignment for radiology report generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2863–2874, 2023c.
  28. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  29. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021a.
  30. Etp: Learning transferable ecg representations via ecg-text pre-training. arXiv preprint arXiv:2309.07145, 2023a.
  31. Exploring and distilling posterior and prior knowledge for radiology report generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  13753–13762, 2021b.
  32. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  33. Contrastive attention for automatic chest x-ray report generation. arXiv preprint arXiv:2106.06965, 2021.
  34. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  35. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  36. Perceptual coding of digital audio. Proceedings of the IEEE, 88(4):451–515, 2000.
  37. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  38. Reinforced cross-modal alignment for radiology report generation. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  448–458, 2022.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  40. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7433–7442, 2023.
  41. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  42. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
  43. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288, 2023b.
  44. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575, 2015.
  45. Ptb-xl, a large publicly available electrocardiography dataset. Scientific data, 7(1):154, 2020.
  46. Efficient large language models: A survey, 2023.
  47. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  48. Adversarial de-noising of electrocardiogram. Neurocomputing, 349:212–224, 2019.
  49. Iot in the era of generative ai: Vision and challenges, 2024.
  50. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
  51. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  52. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  53. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  54. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  55. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  56. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  57. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023a.
  58. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  59. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023b.
  60. Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation. In International Conference on Machine Learning, pp.  12736–12746. PMLR, 2021.
  61. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com