Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries (2403.01002v2)

Published 1 Mar 2024 in cs.CL and cs.AI

Abstract: Summarizing clinical text is crucial in health decision-support and clinical research. LLMs have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation, especially in safety-critical domains such as health. Holistically evaluating text summaries is challenging because they may contain unsubstantiated information. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. It decomposes the evaluation process into a grounded procedure that uses an LLM for relatively simple structuring and scoring tasks, rather than the full task of holistic summary evaluation. Experiments show that AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization. Additionally, AS yields interpretations in the form of a short text span corresponding to each output, which enables efficient human auditing, paving the way towards trustworthy evaluation of clinical information in resource-constrained scenarios. We release our code, prompts, and an open-source benchmark at https://github.com/microsoft/attribute-structuring.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  3. OpenAI. Gpt-4 technical report. ArXiv, 2023. URL https://arxiv.org/pdf/2303.08774.pdf.
  4. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. arXiv preprint arXiv:2212.07981, 2022.
  5. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57, 2024.
  6. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  7. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
  8. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=uccHPGDlao.
  9. Automated metrics for medical multi-document summarization disagree with human evaluations. arXiv preprint arXiv:2305.13693, 2023.
  10. Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197, 2022.
  11. Development of a quality scoring tool to assess quality of discharge summaries. Journal of Family Medicine and Primary Care, 7(2):394, 2018.
  12. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  13. Doclens: Multi-aspect fine-grained evaluation for medical text generation, 2024.
  14. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  15. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  18. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  19. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  20. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4768–4777, 2017.
  21. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
  22. Interpretable machine learning: Fundamental principles and 10 grand challenges. arXiv preprint arXiv:2103.11251, 2021.
  23. Augmenting interpretable models with large language models during training. Nature Communications, 14(1):7913, 2023.
  24. Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2, pages 37–39, 2022.
  25. Towards consistent natural-language explanations via explanation-consistency finetuning. arXiv preprint arXiv:2401.13986, 2024.
  26. Rcot: Detecting and rectifying factual inconsistency in reasoning by reversing chain-of-thought. arXiv preprint arXiv:2305.11499, 2023.
  27. Self-verification improves few-shot clinical information extraction. arXiv preprint arXiv:2306.00024, 2023.
  28. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654, 2024.
  29. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023.
  30. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031, 2023.
  31. Rethinking interpretability in the era of large language models. arXiv preprint arXiv:2402.01761, 2024.
  32. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  33. Context-faithful prompting for large language models. arXiv preprint arXiv:2303.11315, 2023.
  34. Tell your model where to attend: Post-hoc attention steering for llms. arXiv preprint arXiv:2311.02262, 2023.
  35. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  36. Explaining patterns in data with language models via interpretable autoprompting. arXiv preprint arXiv:2210.01848, 2022.
  37. k𝑘kitalic_k nn prompting: Beyond-context learning with calibration-free nearest neighbor inference. arXiv preprint arXiv:2303.13824, 2023.
  38. Tree prompting: efficient task adaptation without fine-tuning. arXiv preprint arXiv:2310.14034, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zelalem Gero (5 papers)
  2. Chandan Singh (42 papers)
  3. Yiqing Xie (22 papers)
  4. Sheng Zhang (212 papers)
  5. Tristan Naumann (41 papers)
  6. Jianfeng Gao (344 papers)
  7. Hoifung Poon (61 papers)
  8. Praveen Subramanian (1 paper)
  9. Paul Vozila (6 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.