Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Verification Improves Few-Shot Clinical Information Extraction (2306.00024v1)

Published 30 May 2023 in cs.CL and cs.LG

Abstract: Extracting patient information from unstructured text is a critical task in health decision-support and clinical research. LLMs have shown the potential to accelerate clinical curation via few-shot in-context learning, in contrast to supervised learning which requires much more costly human annotations. However, despite drastic advances in modern LLMs such as GPT-4, they still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs. This is made possible by the asymmetry between verification and generation, where the latter is often much easier than the former. Experimental results show that our method consistently improves accuracy for various LLMs in standard clinical information extraction tasks. Additionally, self-verification yields interpretations in the form of a short text span corresponding to each output, which makes it very efficient for human experts to audit the results, paving the way towards trustworthy extraction of clinical information in resource-constrained scenarios. To facilitate future research in this direction, we release our code and prompts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1998–2022, 2022.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Chase, H. Langchain: Building applications with llms through composability. https://github.com/hwchase17/langchain, 1 2023.
  4. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  5. Automated medical coding on mimic-iii and mimic-iv: A critical review and replicability study. arXiv preprint arXiv:2304.10909, 2023.
  6. Rarr: Researching and revising what language models say, using language models. arXiv preprint arXiv:2210.08726, 2022.
  7. Pal: Program-aided language models, 2023.
  8. Thinking about gpt-3 in-context learning for biomedical ie? think again. arXiv preprint arXiv:2203.08410, 2022.
  9. Mimic-iv-ed. PhysioNet, 2021.
  10. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  11. Gpt is becoming a turing machine: Here are some ways to program it, 2023.
  12. Predictability and stability testing to assess clinical decision instrument performance for children after blunt torso trauma. PLOS Digital Health, 1(8):e0000076, 2022.
  13. Danish clinical named entity recognition and relation extraction. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pp.  655–666, 2023.
  14. Assessing the value of chatgpt for clinical decision support optimization. MedRxiv, pp.  2023–02, 2023.
  15. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  16. Large language model is not a good few-shot information extractor, but a good reranker for hard samples! arXiv preprint arXiv:2303.08559, 2023.
  17. Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2, pp.  37–39, 2022.
  18. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  19. Clinical abbreviation sense inventory. 2012.
  20. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2018, pp.  197. NIH Public Access, 2018.
  21. OpenAI. Gpt-4 technical report, 2023.
  22. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  23. Art: Automatic multi-step reasoning and tool-use for large language models, 2023.
  24. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  25. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
  26. Measuring and narrowing the compositionality gap in language models, 2022.
  27. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019.
  28. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870, 2021.
  29. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.  1135–1144, 2016.
  30. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215, 2019.
  31. Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
  32. Toolformer: Language models can teach themselves to use tools, 2023.
  33. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  34. Augmenting interpretable models with llms during training. arXiv preprint arXiv:2209.11799, 2022a.
  35. Explaining patterns in data with language models via interpretable autoprompting. arXiv preprint arXiv:2210.01848, 2022b.
  36. Explaining black box text modules in natural language with language models. arXiv preprint arXiv:2305.09863, 2023.
  37. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031, 2023.
  38. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2714–2730, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.174.
  39. Clinical information extraction applications: a literature review. Journal of biomedical informatics, 77:34–49, 2018.
  40. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019.
  41. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023.
  42. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In CHI Conference on Human Factors in Computing Systems, pp. 1–22, 2022.
  43. k𝑘kitalic_k nn prompting: Beyond-context learning with calibration-free nearest neighbor inference. arXiv preprint arXiv:2303.13824, 2023.
  44. Summit: Iterative text summarization via chatgpt. arXiv preprint arXiv:2305.14835, 2023.
  45. Frontiers of biomedical text mining: current progress. Briefings in bioinformatics, 8(5):358–375, 2007.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zelalem Gero (5 papers)
  2. Chandan Singh (42 papers)
  3. Hao Cheng (190 papers)
  4. Tristan Naumann (41 papers)
  5. Michel Galley (50 papers)
  6. Jianfeng Gao (344 papers)
  7. Hoifung Poon (61 papers)
Citations (46)

Summary

We haven't generated a summary for this paper yet.