Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded Entity Retrieval (2401.02369v2)

Published 4 Jan 2024 in cs.CL

Abstract: Clinician must write a lengthy summary each time a patient is discharged from the hospital. This task is time-consuming due to the sheer number of unique clinical concepts covered in the admission. Identifying and covering salient entities is vital for the summary to be clinically useful. We fine-tune open-source LLMs (Mistral-7B-Instruct and Zephyr-7B-beta) on the task and find that they generate incomplete and unfaithful summaries. To increase entity coverage, we train a smaller, encoder-only model to predict salient entities, which are treated as content-plans to guide the LLM. To encourage the LLM to focus on specific mentions in the source notes, we propose SPEER: Sentence-level Planning via Embedded Entity Retrieval. Specifically, we mark each salient entity span with special "{{ }}" boundary tags and instruct the LLM to retrieve marked spans before generating each sentence. Sentence-level planning acts as a form of state tracking in that the model is explicitly recording the entities it uses. We fine-tune Mistral and Zephyr variants on a large-scale, diverse dataset of ~167k in-patient hospital admissions and evaluate on 3 datasets. SPEER shows gains in both coverage and faithfulness metrics over non-guided and guided baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Overview of the mediqa-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 503–513.
  2. What’s in a summary? laying the groundwork for advances in hospital-course summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4794–4811, Online. Association for Computational Linguistics.
  3. Generating EDU extracts for plan-guided summary re-ranking. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2680–2697, Toronto, Canada. Association for Computational Linguistics.
  4. From sparse to dense: GPT-4 summarization with chain of density prompting. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 68–74, Hybrid. Association for Computational Linguistics.
  5. Zero-shot clinical acronym expansion via latent meaning cells. In Machine Learning for Health, pages 12–40. PMLR.
  6. What are the desired characteristics of calibration sets? identifying correlates on long form scientific summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10520–10542, Toronto, Canada. Association for Computational Linguistics.
  7. Learning to revise references for faithful summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4009–4027, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  8. A meta-evaluation of faithfulness metrics for long-form hospital-course summarization. Machine Learning for Healthcare, pages 1–36.
  9. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  10. AREDSUM: Adaptive redundancy-aware iterative sentence ranking for extractive document summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 281–291, Online. Association for Computational Linguistics.
  11. Automated clinical coding using off-the-shelf large language models. arXiv preprint arXiv:2310.06552.
  12. Evaluating factual consistency of summaries with large language models.
  13. Toward expanding the scope of radiology report summarization to multiple anatomies and modalities. arXiv preprint arXiv:2211.08584.
  14. Spec: A soft prompt-based calibration on performance variability of large language model in clinical notes summarization.
  15. The future landscape of large language models in medicine. Communications Medicine, 3(1):141.
  16. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  17. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  18. Dina Demner-Fushman and Noemie Elhadad. 2016. Aspiring to unintended consequences of natural language processing: a review of recent developments in clinical and consumer-generated text processing. Yearbook of medical informatics, 25(01):224–233.
  19. GSum: A general framework for guided neural abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4830–4842, Online. Association for Computational Linguistics.
  20. Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing. Journal of Biomedical Informatics, 138:104286.
  21. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
  22. CTRLsum: Towards generic controllable text summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5879–5915, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  23. Robert E Hirschtick. 2006. Copy-and-paste. Jama, 295(20):2335–2336.
  24. Overview of imageclef 2023: Multimedia retrieval in medical, socialmedia and recommender systems applications. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 14th International Conference of the CLEF Association (CLEF 2023), Springer Lecture Notes in Computer Science LNCS, Thessaloniki, Greece.
  25. Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.
  26. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
  27. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  28. Association of electronic health record design and use factors with clinician stress and burnout. JAMA network open, 2(8):e199609–e199609.
  29. Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4228–4238, Online. Association for Computational Linguistics.
  30. Exploring the boundaries of gpt-4 in radiology. arXiv preprint arXiv:2310.14573.
  31. Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740, Hong Kong, China. Association for Computational Linguistics.
  32. Tailoring large language models to radiology: A preliminary approach to llm adaptation for a highly specialized domain. In International Workshop on Machine Learning in Medical Imaging, pages 464–473. Springer.
  33. Christina Maslach and Michael P Leiter. 2016. Understanding the burnout experience: recent research and its implications for psychiatry. World psychiatry, 15(2):103–111.
  34. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164.
  35. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. Journal of the American Medical Informatics Association, 28(5):998–1008.
  36. Planning with learned entity prompts for abstractive summarization. Transactions of the Association for Computational Linguistics, 9:1475–1492.
  37. & Medicine & others National Academies of Sciences, Engineering. 2019. Taking action against clinician burnout: a systems approach to professional well-being. National Academies Press.
  38. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
  39. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452.
  40. Association between physician burnout and patient safety, professionalism, and patient satisfaction: a systematic review and meta-analysis. JAMA internal medicine, 178(10):1317–1331.
  41. Roy H Perlis and Stephan D Fihn. 2023. Evaluating the application of large language models in clinical research contexts. JAMA Network Open, 6(10):e2335924–e2335924.
  42. Stanza: A python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108, Online. Association for Computational Linguistics.
  43. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  44. Physical, psychological and occupational consequences of job burnout: A systematic review of prospective studies. PloS one, 12(10):e0185781.
  45. Discharge summary hospital course summarisation of in patient electronic health record text with clinical concept guided deep pre-trained transformer models. Journal of Biomedical Informatics, 141:104358.
  46. Relationship between clerical burden and characteristics of the electronic environment with physician burnout and professional satisfaction. In Mayo Clinic Proceedings, volume 91, pages 836–848. Elsevier.
  47. An entity-driven framework for abstractive summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3280–3291, Hong Kong, China. Association for Computational Linguistics.
  48. Towards clinical encounter summarization: Learning to compose discharge summaries from prior notes. arXiv preprint arXiv:2104.13498.
  49. Association of medical directors of information systems consensus on inpatient electronic health record documentation. Applied clinical informatics, 4(02):293–303.
  50. Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties. Annals of internal medicine, 165(11):753–760.
  51. Evaluating gpt-4 on impressions generation in radiology reports. Radiology, 307(5):e231259.
  52. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552–556.
  53. RadAdapt: Radiology report summarization via lightweight domain adaptation of large language models. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 449–460, Toronto, Canada. Association for Computational Linguistics.
  54. Clinical text summarization: Adapting large language models can outperform human experts.
  55. Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8640–8665, Toronto, Canada. Association for Computational Linguistics.
  56. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  57. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848.
  58. Biomedical and clinical English model packages for the Stanza Python NLP library. Journal of the American Medical Informatics Association, 28(9):1892–1899.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Griffin Adams (14 papers)
  2. Jason Zucker (4 papers)
  3. Noémie Elhadad (28 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.