Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency? (2404.06503v1)

Published 9 Apr 2024 in cs.CL

Abstract: Following an interaction with a patient, physicians are responsible for the submission of clinical documentation, often organized as a SOAP note. A clinical note is not simply a summary of the conversation but requires the use of appropriate medical terminology. The relevant information can then be extracted and organized according to the structure of the SOAP note. In this paper we analyze two different approaches to generate the different sections of a SOAP note based on the audio recording of the conversation, and specifically examine them in terms of note consistency. The first approach generates the sections independently, while the second method generates them all together. In this work we make use of PEGASUS-X Transformer models and observe that both methods lead to similar ROUGE values (less than 1% difference) and have no difference in terms of the Factuality metric. We perform a human evaluation to measure aspects of consistency and demonstrate that LLMs like Llama2 can be used to perform the same tasks with roughly the same agreement as the human annotators. Between the Llama2 analysis and the human reviewers we observe a Cohen Kappa inter-rater reliability of 0.79, 1.00, and 0.32 for consistency of age, gender, and body part injury, respectively. With this we demonstrate the usefulness of leveraging an LLM to measure quality indicators that can be identified by humans but are not currently captured by automatic metrics. This allows scaling evaluation to larger data sets, and we find that clinical note consistency improves by generating each new section conditioned on the output of all previously generated sections.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Overview of the MEDIQA-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 503–513, Toronto, Canada. Association for Computational Linguistics.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  3. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  4. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30).
  5. Revisiting text decomposition methods for NLI-based factuality scoring of summaries. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 97–105, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  6. Prometheus: Inducing evaluation capability in language models. In The Twelfth International Conference on Learning Representations.
  7. Generating SOAP notes from doctor-patient conversations using modular summarization techniques. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4958–4972, Online. Association for Computational Linguistics.
  8. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  9. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  10. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  11. MedicalSum: A guided clinical abstractive summarization model for generating medical reports from patient-doctor conversations. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4741–4749, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  12. OpenAI. 2023. Gpt-4 technical report. arXiv preprint, arXiv/2303.08774.
  13. Investigating efficiently extending transformers for long input summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3946–3961, Singapore. Association for Computational Linguistics.
  14. Generating more faithful and consistent soap notes using attribute-specific parameters. Proceedings of Machine Learning Research, 219:1–19, 2023.
  15. Are you dictating to me? detecting embedded dictations in doctor-patient conversations. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 587–593.
  16. Large scale sequence-to-sequence models for clinical note generation from patient-doctor conversations. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 138–143, Toronto, Canada. Association for Computational Linguistics.
  17. Extract and Abstract with BART for Clinical Notes from Doctor-Patient Conversations. In Proc. Interspeech 2022, pages 2488–2492.
  18. Llama 2: Open foundation and fine-tuned chat models.
  19. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  20. Enhancing medical text evaluation with gpt-4.
  21. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Scientific Data, 10.
  22. Evaluating large language models at evaluating instruction following. In The Twelfth International Conference on Learning Representations.
  23. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 11328–11339. PMLR.
  24. Leveraging pretrained models for automatic summarization of doctor-patient conversations. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3693–3712, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  25. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, pages 46595–46623. Curran Associates, Inc.
  26. Fine-tuning language models from human preferences. arXiv preprint, arXiv/1909.08593.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.